Skip to end of metadata
Go to start of metadata

Announcement History

Emergency Hardware Maintenance Planned for 10-a  Filer
The file system which hosts 10a has a hardware issue with the networkinterface card. Logins for all users may be affected and some users (with home directories on 10-a) will not be able to login at all. To address this issue a system down time is planned for March 16, from 7:30am - 9:30am to change the hardware card.

We regret any inconvenience caused as a result of this.  Service updates will be posted at http://icer.msu.edu/service-status

Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.  

Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309

  

Last changed Mar 20, 2017 13:42 by Andrew R Keen
Labels: maintenance

Due to temperature issues in the HPCC machine room, the Laconia cluster was powered off.  We're investigating the issue and will post more information as it becomes available.

Update: 1:40 PM

Due to a failure of a cooling system and high spot temperatures in-room, the Laconia (intel16) cluster was powered off at 12:40 PM at direction of Dr. Bill Punch, associate ICER director, as a precautionary measure to protect the system. Any jobs running on Laconia were terminated. We apologize to users whose work was affected.

We have resumed scheduling jobs as of 1:30 PM. Please contact us if your work was interrupted and you need special accommodations to meet a deadline.

Posted at Mar 20, 2017 by Jim Leikert | 0 comments
Last changed Mar 10, 2017 17:05 by Andrew R Keen
Labels: maintenance

From 6 AM to 3:30 PM on March 7th, the HPCC was unavailable. During the maintenance window, a number of system updates were applied to improve performance and stability. Moab was upgraded to improve performance and reliability. Our home directory file systems will now restart much more quickly, and they will have greater network bandwidth and storage firmware; our identity management systems were tuned to be more robust, and our network firmware on the Laconia cluster has been updated to the newest version. Some of our large memory hosts have improved network connections to the scratch storage systems. Our firewall has been updated and features that will improve performance have been enabled.

As part of the Moab upgrade, we've identified an issue where old data was imported; some users may have received emails that old jobs had been cancelled or are not available in the queue. We've identified the source of these errors (leftover checkpoint and caching data) and are removing them from the system. Most have been removed; if you have any problems with the scheduler or any other problems with the system, please let us know.

We also identified an issue where users may have received a "Stale File Handle" message when accessing their home or research space. We've fixed this issue.

Posted at Mar 10, 2017 by Andrew R Keen | 0 comments
Warning: data loss is possible without action
We are changing how we handle old files on our Lustre file system.  As of February 20th, any files older than 45 days present on this file system on /mnt/ls15/scratch will be permanently deleted.  Until now we've been purging files older than 45 days and holding these old files, but we are no longer able to do so.  
The 'scratch' file system is for temporary working files only and not a file storage solution.  You must copy any files you wish to keep from scratch to your home directory, shared research folder, or transfer to an external system or those files will be lost and cannot be recovered.   
The size of scratch prohibits backups, and your home directory has snapshots and offsite backups.  You may request up to 1TB for free.  See https://wiki.hpcc.msu.edu/display/hpccdocs/HPCC+File+Systems+Overview  
Additional storage space on home or shared research folders is available for reasonable annual cost.  https://contact.icer.msu.edu/large_quota
Posted at Feb 07, 2017 by Brian Roth | 0 comments
Last changed Jan 31, 2017 13:19 by Andrew R Keen
The Intel compiler suite, including VTune, is unavailable at this time as we are in the process of renewing the license.  We expect the process to be completed by the end of the week - Friday, January 25, 2017.

The license was updated on January 24th and the Intel compiler suite is now available.
Posted at Jan 25, 2017 by Charles D Miller | 0 comments
Last changed Jan 19, 2017 02:11 by Andrew R Keen
Labels: maintenance

At 3:55 PM, the chilled water unit that cools the HPCC facility failed. We paused the scheduler at 4:05 PM. We had to shut down about 100 nodes at 4:23 PM to prevent overheating. Some users jobs may have been terminated at that time. Most jobs have continued to run.  The chiller has been reset and we are continuing to monitor its operation; once we are confident that temperatures are stable we will restart scheduling new jobs.

Update: 2 AM. Maintenance has been completed and the system has resumed scheduling jobs. The current indication is that a bad water temperature sensor caused the chiller to shut down.

Posted at Jan 18, 2017 by Andrew R Keen | 0 comments

ITS has scheduled work on the campus authentication service between 3:30 and 4 PM today, December 29th, 2016. During this time users may not be able to authenticate with their MSU NetID username and passwords to any HPCC system. Any active sessions should not be interrupted, so users are advised to log in before the maintenance window and leave any active sessions that they may need logged in. Access to the contact forms and the ticketing system should not be impacted.

Posted at Dec 29, 2016 by Andrew R Keen | 0 comments

We've had a number of issues with our Lustre file system (https://wiki.hpcc.msu.edu/x/JIC0AQ).

If you got error messages related to Lustre file system (scratch), please use temporary directory.

Please refer to https://wiki.hpcc.msu.edu/x/koC0AQ

Posted at Dec 21, 2016 by Yongjun Choi | 0 comments
Last changed Dec 22, 2016 11:39 by Xiaoxing Han
Labels: maintenance

Dear HPCC Users,

As you are all probably aware, ITS made a change to its firewall rules that blocked access to campus via SSH last Tuesday, 12/20/16. The SSH protocol is fundamental to the operation of HPCC, which provides access to the center by remote login, file transfer, and various other aspects. Unfortunately, ITS did not announce this change to HPCC nor to anyone else on campus. Thus we, along with every other facility on campus, were caught unaware when no one off campus could login. This blockage remains in effect generally for all of campus as of Thursday, 12/22/16.

ITS reopened the HPCC port, gateway.hpcc.msu.edu (35.9.12.10) on Wednesday, 12/21/16 after Vice President Hsu agreed to take responsibility for security matters at HPCC for 30 days. In those 30 days, we will revisit with ITS their impact on the center and what actions should have been taken with regards to security at HPCC. I'm sure others, both individually and as a group, will be doing the same.

We at HPCC have not been notified of any increased security risks to the center or any specific threats to our systems. We will continue to be proactive to ensure the security of the HPCC systems and data.

Finally, if you get locked out again, it is suggested that you use the MSU virtual private network by visiting the web page https://vpn.msu.edu . When you log in and hit the "Start" button, a VPN tunnel is created and should allow you remote access. Not everyone has access to the VPN (for example undergrads are excluded) but many will be able to use this if SSH ports get blocked again.

We will try and keep you updated on the situation at this wiki page: 

Sincerely,
Bill Punch
HPC director
Assoc. Prof CSE

 

 

 

Posted at Dec 20, 2016 by Patrick S Bills | 0 comments
Last changed Jan 04, 2017 23:27 by Andrew R Keen

Our next regular scheduled maintenance window will be on Wednesday January 4th. We anticipate that all HPCC services will be offline all day. Progress updates will be posted on the HPCC Wiki and ICER social media.

The current targeted work includes power and cooling maintenance, network upgrades, storage system updates and testing, firmware updates, and minor compute system software updates. A set of systems will be made available next week for users to test.

This maintenance window was previously announced as January 3rd; it has been moved to accommodate the new year holiday.

Please contact us at https://contact.icer.msu.edu/contact if you have any questions or concerns.

Update: 8:45 AM. The power maintenance has been completed. 35% of the compute nodes have been reimaged. Network firmware updates are underway.

Update: 5:20 PM. The OS updates have been completed on the development nodes (excluding dev-intel16-k80) and most of the compute nodes. Home directory server work is complete; scratch is still undergoing maintenance. Gateway and the development nodes are available to users.

Update: 8 PM. The server component of the scratch maintenance has been completed and most nodes have been reimaged. We are completing final checks on scratch and anticipate a return to service by 10 PM.

Update: 11 PM The scheduler has resumed and jobs are running as normal. Most nodes have been returned to service. The remainder will return tomorrow. All services should be operating normally. Please contact us with any issues.

Posted at Dec 15, 2016 by Andrew R Keen | 0 comments
Last changed Dec 16, 2016 11:00 by Patrick S Bills
Labels: maintenance

The current weather forecast for East Lansing on Monday December 18th forecasts that temperatures will fall below 0 degrees Fahrenheit between 5 AM and 10 AM. Because of a design issue, if our cooling system stops it will not restart until temperatures rise above 0 degrees, which would cause the HPC computers to overheat. As a precaution, we have blocked any new jobs that would run between those hours to reduce the amount of work lost if an emergency shutdown is required.

We will continue to monitor and may cancel the block if the forecast changes. Please let us know if you have any issues.

MSU's new energy-efficient data center is currently under construction and will resolve these issues. Construction on the data center is scheduled to complete in 2018.

Please contact us at https://contact.icer.msu.edu/contact if you have any questions or concerns.  

Note: we are also continuing to work through

Posted at Dec 15, 2016 by Andrew R Keen | 0 comments
Labels: maintenance

We've had a number of issues with our Lustre file system (/mnt/ls15 or /mnt/scratch) over the past few months. We've put together this blog post to summarize them. If you are currently experiencing problems that you suspect are related to the scratch file system, please open a ticket with us.

  • Metadata server full. On November 21st, the 270 million file capacity of Lustre was reached. Due to a bug in Lustre with ZFS, it is not possible to delete files on a full file system. We were able to bring additional capacity online to delete files and return to service that afternoon. We have implemented a quota, as previously announced in August. We are also currently purging any files older than 45 days as previously announced.
  • Quota: Your current quota on scratch/ls15 is the number of files you have on scratch + 1 million. The previously announced quota of 1 million files will be phased in over the next few weeks; affected users will be notified by email before enforcement happens. We will also begin phasing in a 50 TB quota on scratch/ls15. Home directories and research spaces are unaffected.
  • Misreporting quota. About 1/8th of our users have an incorrect quota reporting due to a Lustre bug. Please contact us if you get a Disk Quota Exceeded errors or are unable to access scratch. You can check your quota with the command:

    from a development node. If you see that you have 16 EB of data or 16x1018  files in use or see an * in the output, please contact us. 

  • Metadata performance: We have had multiple issues in the previous few months where users submit many hundreds of jobs that generate and delete tens of thousands of files per minute. These operations are very expensive; Lustre is not optimized for small file IO. The file system can become unresponsive to other users while this is happening. Files on scratch should only be 1 MB or more; for optimum performance at least 1 GB per file. Status: We continue to work with users to educate and identify problematic workloads.
  • Intel14 fabric links. There were a few missing links on the Intel14 fabric that were degrading performance. They have been replaced.
  • Intel14 firmware update. We are tracking a communication error on the Intel14 cluster. The Intel14 cluster is running an older version of the Infiniband network firmware. We are in the process of updating these nodes.
  • No space left on device or Input/Output error. We believe that these are caused by communication issues on the network fabric. Early testing has shown improvements with the firmware mentioned in the previous bullet point. If you have these errors on scratch please contact us.
  • Missing files?. Some users have reported messages from noreply@hpcc.msu.edu with the subject line "PBS JOB 31234567.mgr04" that have messages like:

    or

    These can indicate a problem with scratch or with TORQUE, Moab, the home directory system, or the nodes your job was running on. Please examine your job script and program outputs for more specific error messages and let us know if you see this issue.

  • 2TB file size limit: If you are attempting to write more than 2 TB to as single file, you will need to adjust the stripe setting of the file or the directory to spread across multiple server targets using the lfs setstripe option. Please contact us for more information.
  • Problem with dev-intel14. Users have reported IO errors on dev-intel14 specifically. Please report the problem and try another development node if you continue to have an issue.
  • Intel16: Lustre performance. Until our August maintenance window, Lustre was running over the Ethernet interfaces on the Intel16 cluster. Single node Lustre performance suffered. This has been resolved.
  • Intel16: Fabric Unbalanced. Due to a physical limitation, all of the Lustre storage servers on the intel16 network were not balanced throughout the fabric. At times of high traffic the Lustre server would lose communication to the intel16 cluster. The fabric has been rebalanced and we are no longer seeing traffic contention on that switch.
  • Failover configuration after update. During our August maintenance window, we identified a bug in the Lustre software that prevented high availability features from functioning. This was patched in September.

Posted at Dec 02, 2016 by Andrew R Keen | 0 comments
Last changed Nov 22, 2016 22:25 by Patrick S Bills
Labels: maintenance

Update:  Maintenance complete and all file services are restored.   We are monitoring file systems carefully.   Please contact us with any concerns.  

Update:  Systems will remain off-line for an additional hour from the original notice.  We are planning to return to service at 4:30 pm today.  We are sorry for the inconvenience and thank you for your patience. 

File systems are currently off-line and users are unable to log-in, connect for drive mapping/File sharing, remote desktop, file transfer or use Globus services per our previous annoucemement on https://wiki.hpcc.msu.edu/x/yAKpAQ  

Posted at Nov 22, 2016 by Patrick S Bills | 0 comments
Last changed Nov 21, 2016 16:34 by Patrick S Bills

Update 4:00pm : SCRATCH service has been restored.  

The Lustre/SCRATCH system is currently unavailable while we mitigate a problem with the metadata server, which has filled unexpectedly.   An unfortunate symptom is that users are not able to read, write, or delete files at this time.   We are sorry for the interruption and will update this when the problem is solved.   Thank you for your patience.  

Posted at Nov 21, 2016 by Patrick S Bills | 0 comments
Last changed Nov 22, 2016 14:52 by Jim Leikert
Labels: maintenance

UPDATE :  The maintenance has been extended until 4:30pm EST, 11/22/2016

NOTICE: All file systems will be closed during this outage and  users will not be able to log-in during this time.   

The HPCC will be conducting system maintenance on User File Servers next Tuesday, November 22, 2016, at 12:30pm until 3:30pm (all times Eastern).  

This maintenance window is needed to address a system stability issue with these filers. 

User jobs that can complete before the maintenance window will continue to run.  User jobs that will cross into the maintenance window will be held until after the maintenance window.  Job scheduling will return to normal after the maintenance window.

Users will not have access to the system during this maintenance window.  This restriction is unfortunately needed to ensure stability during the maintenance.  

Please contact us at https://contact.icer.msu.edu/contact if you have any questions or concerns.  

Posted at Nov 16, 2016 by Anthony Parker | 0 comments
Last changed Nov 16, 2016 09:07 by Anthony Parker
Labels: maintenance

A failed hardware component on ufs-12-a took it offline at 10:45 PM on 11/14/2015. At about 1 AM on 11/15, ufs-12-b attempted to take over but shut itself down due to a perceived hardware issue. Users and research spaces on ufs-12-a and ufs-12-b will be unable to access the system until access is restored.

UPDATE 16 NOV 09:06am  Access to filesystems hosted on ufs-12-a & b has been restored.   

Posted at Nov 15, 2016 by Andrew R Keen | 0 comments

 

  • No labels