$resultHtml
Blog

Blog

This morning an incident on Ufs-11-a cause the stat store to move to Ufs-11-b.  We will be restoring the data store to its normal operating state.  Tomorrow Friday June 15th at 8am, home directories and research spaces on these filers will temporarily be unavailable for approximately 30 to 45 minutes.


Update: 8:55 am Restoration of High Availability conpleted at 8:15 am.  As a result of the restoration all gateways needed to be restarted.  gateway-03 is currently in a failed state and we are attempting to restore.



Scratch unavailability

Currently a portion of scratch is unavailable.  We are working with the vendor to restore functionality.  Users may see issues while attempting to access files.  We will provide further updates when available.


Update:  As of 5:30 pm yesterday evening the storage down storage on scratch has been restored.

New web based remote desktop


We have recently deployed a new web based remote desktop server.  This server allows for users to connect via we browser without any additional configuration to your system needed. We will be deactivating the current remote desktop server on Monday June 25th. Please be sure to attempt to log in to the new web based server and contact us with questions or concerns. For information on how to connect please see the following link: https://wiki.hpcc.msu.edu/display/ITH/Connecting+with+a+Web-based+Remote+Desktop

April 30, 2018  RT is now up and running.

April 30, 2018  The RT web interface and email is currently down.

---


April 19, 2018   MSU IT Services' fix has been implemented and email has been restored.

April 18, 2018   MSU IT Services has a fix in the works and have said it will be put into place at 6am EDT on April 19.

April 17, 2018   Email access is NOT restored

Email to our RT tracking/help desk system (RT)  is currently not getting through.  You may still contact us via http://contact.icer.msu.edu, and  and read and respond to your current issues (tickets) by logging into  https://rt.hpcc.msu.edu/   Unfortunately, if you reply via email we will not receive it.   You use https://rt.hpcc.msu.edu/ to reply.  The RT update requires a change in MSU network security to be complete.   MSU IT Services has been contacted and we anticipate a fix soon.   We apologize for this  inconvenience.


---
1:20 PM Email access on-campus has been restored.  UPDATE: MSU Network Firewall is blocking access, hence email is effectively not restored 

1:00 PM Currently opening an RT ticket by sending an email does not work, however accessing rt.hpcc.msu.edu works as does opening a ticket via https://contact.icer.msu.edu.

On April 17, 2018 the RT ticking system will be down for maintenance between 5:30am and 8:30am EDT.



On monday April 16th at 8am we will be taking the globus / rsync gateway offline for an update to the Lustre (Scratch).  The window will last approximately one hour during which rsync.hpcc.msu.edu and Globus will be unavailable.


4-16-18 Maintenance was completed at 8:40am


The main ICER website http://www.icer.msu.edu  will be unavailable at 2 PM today for emergency maintenance. All other ICER services are unaffected.

ECC Enabled on all K80 GPUs

The ECC feature on all iCER's Tesla K80 GPUs has been enabled. This change will take effect after the upcoming scheduled maintenance outage.

The HPCC will be unavailable on March 6th for security and stability software updates.

Users should note that the latest security patches may have a significant performance impact on their code, particularly applications that work with files. Please contact us if you are impacted by this and we can explore what can be done to mitigate.


Update: 4:45 PM All maintenance has been completed by 4:35 PM and job scheduling has been resumed. Users may experience a delay to access home directories and research spaces on ufs-11 in the next hour while one remaining maintenance item is addressed.

Update: 6 PM ufs-11 has been returned to production service.



On 1-20-18 a single drive failed on Ufs-13-a which caused other drives on the same controller to become unresponsive.  The file system became unresponsive for approximately an hour and then our high availability system moved the data pool over to Ufs-13-b.  As a result a very small amount of files, approximately 20, that were being written at the time of the failure were lost.  At this time home directories on ufs-13-a will not be able to mount to your local system unless you are using sshfs.  Included below is a link on mounting a drive with sshfs. 

We will be working with our vendor to investigate what caused the failure and how this can be avoided in the future.  


On Tuesday January 30th at 7am we will be taking the data pool offline for both Ufs-13-a and Ufs-13-b to resume normal operations.

This maintenance will last about one hour, during which all users and research spaces on these storage systems will be unavailable.


Update:  Both ufs-13-a and ufs-13-b were taken offline to restore complete functionality at 7:15 am and were restored to online status at 7:45 am



Mapping HPC drives with SSHFS

Samba Update

We have now resolved issues with samba on ufs-12-a and ufs-12-b.  This service should now be available to all users.  If you are having trouble connecting please see our samba wiki below


Mapping HPC drives with Samba

Due to a system problem, one of the components of the Lustre (ls15) scratch file system is currently offline. We are currently working with the vendor to restore access to the full system. Access to files on the affected server will hang until access has been restored. The scheduler has been paused until the issue is resolved.

Update: 6 PM: The vendor is continuing to diagnose the problem. The scheduler has been resumed. Users can see which of their files are on the affected OST by using the following command:

lfs find $SCRATCH -obd ls15-OST0012_UUID


Update 12:45pm 12-19:  We continue to work with our Lustre vendor to diagnose the current issue with /mnt/ls15.  


Update 9am 12-21:  We are currently working on cleaning up lost file data from the lustre failure.  We anticipate lustre operations to likely return to normal late this afternoon.


Update 4pm 12-21:  All user and group scratch directories have been cleaned of lost files and should be responding properly.  Please open a ticket if you are still having trouble with your scratch directory. 

The HPCC will be unavailable on January 4th, 2018 to perform scheduled maintenance on core systems.

Update: 4:30 PM. The maintenance has been completed and job scheduling has resumed. There is one outstanding issue: users with home directories or research spaces on ufs-12 will not be able to access systems via SMB on their desktops. Users can use SSHfs as described here

to access their home directory or research space on the ufs-12 server until service is restored.

We have disabled the SMB version 1 (SMBv1) protocol on our filers running Red Hat Enterprise Linux 7 (ufs-11-a, ufs-11-b, ufs-13-a and ufs-13-b) due to a critical vulnerability. SMBv2 and SMBv3 remain available. This is an older version of the protocol that allows you to access your home directory or research space on campus from your desktop or laptop. In our testing, all modern operating systems allow SMBv2 or v3, unless it has been explicitly disabled; check our documentation for instructions on how to enable SMBv3 and disable SMBv1. If you are having trouble accessing your home directory or research space on one of these systems after this change, please open a ticket.

HPCC systems will be unavailable from Tuesday, September 19th from 7 AM to 7 PM, to improve stability of our home directory storage system. We apologize for the interruption in your work.

Update: The work was completed at 8:25 PM and the system was returned to service. Users with research spaces on ufs-11-b may experience intermittent permissions issues. This will be resolved Wednesday morning.

Due to a bug currently in zfs, we have experienced several instances of our home directory filers going offline.  This is caused when a user restores files from a snapshot.  Zfs is not unmounting the snapshot properly and when our auto snapshot process runs it can cause a kernel panic bringing the system offline.  To mitigate this we will be disabling access to the .zfs directory for all user home directories and research spaces.  If you are in need of file restoration please open a ticket.  We apologize for the inconvenience however this will help improve system stability.  When a viable fix for the issue is presented by ZFS we will allow access to the snapshots once again.