Blog

Blog

HPCC systems will be unavailable from Tuesday, September 19th from 7 AM to 7 PM, to improve stability of our home directory storage system. We apologize for the interruption in your work.

Update: The work was completed at 8:25 PM and the system was returned to service. Users with research spaces on ufs-11-b may experience intermittent permissions issues. This will be resolved Wednesday morning.

Due to a bug currently in zfs, we have experienced several instances of our home directory filers going offline.  This is caused when a user restores files from a snapshot.  Zfs is not unmounting the snapshot properly and when our auto snapshot process runs it can cause a kernel panic bringing the system offline.  To mitigate this we will be disabling access to the .zfs directory for all user home directories and research spaces.  If you are in need of file restoration please open a ticket.  We apologize for the inconvenience however this will help improve system stability.  When a viable fix for the issue is presented by ZFS we will allow access to the snapshots once again.

Update: 11:27 am :  Issues with the scheduling system have been resolved and service has been restored.   Some compute nodes have remaining issues that are being addressed.   

As of approximately 10 am, the HPC scheduling software is off-line for emergency maintenance.   We are actively working to restore service as soon as possible.    Users can log-in, but are not able to schedule new jobs or monitor existing jobs until service is restored.  As always, thank you for your patience.  

Ufs-13-a Outage

Unrelated to yesterdays trouble with ufs-13-b, Ufs-13-a experienced as kernel panic causing it to be unavailable from about 10pm until this morning.  We are looking into way of mitigating this issue.  Again we apologize for the inconvenience.

Ufs-13-b outage


At about 4pm today we attempted to remove an old snapshot which caused the ufs-13-b filer to become unresponsive for approximately 40 minutes.  The filer has been rebooted and is currently back online.  We apologize for any inconvenience. 

Users and research spaces affected by the failure of the 11-a filer were migrated to the 13-a and 13-b filers. Accessing data when logged into HPCC systems is unchanged, however, this migration will affect how data is accessed using Samba.

If you are having difficulty accessing your data after the August 8th downtime, please refer to our wiki: Mapping HPC drives to a campus computer with SMB.

Due to a catastrophic error on the file server ufs-11-a, research spaces and home directories on that file server are currently unavailable. All data is safe and we are restoring from a current backup, but access to those spaces has been blocked until they have been restored. We will send affected users an email once their data has been restored and they have been unblocked.

We sincerely apologize for the interruption. Please let us know if this will negatively impact your work and we will see what we can do to ensure that you can complete your work in a timely manner.

Update: 8/11 5 PM: Most home directories (864 of 877) and research spaces (33 of 39) that were on ufs-11-a have been restored.

Update: 8/14 11 AM: All but one user (876 of 877) and research space (38 of 39) that were on ufs-11-a have been restored. About 300 TB of data has been restored from the offsite systems.

Update: 8/15 11 AM: All users and research spaces that were on ufs-11-a have been restored. Please contact us if you have any issues.

Scheduler Offline (Resolved)

Starting at approximately 10am this morning our scheduler crashed and is currently offline.  We are looking into the issue and will bring the scheduler back online as soon as possible.

UPDATE: 11:30 AM The job manager (TORQUE) failed to restart due to a large number of job array files that were left behind from jobs that had completed successfully. Staff identified the problem and were able to remove the ~130,000 leftover job files and restart the server at approximately 11:10 AM.

All HPCC systems will be unavailable from Tuesday, August 8th from 6 AM to 6 PM. The primary focus will be on improving the performance and stability of our storage systems, as well as general system updates. Users should note the following changes:

  • New SMB configuration for Windows File Sharing / Home Directory mounting (Samba). We'll be changing authentication domains from HPCC to CAMPUSAD and raising the minimum version of SMB required to connect to version 2.  Please see the following link for updated mounting information.

                  Mapping HPC drives to a campus computer with SMB  

  • New gateway servers. We will be deploying additional gateway nodes to provide a better user experience and minimize disruption when one server has an issue. Users may notice that access to gateway.hpcc.msu.edu may report a different host name once they connect. Users can connect to a specific gateway if desired, but it is not recommended.
Ufs-12-a maintenance

On 7-20-17 at approximately 4:45 pm ufs-12-a experienced a short outage due to excessive network traffic.  The data is currently being hosted on the ufs-12-b server due to a high availability fail over.  We will be performing maintenance on Monday 7-24 at 7am EDT to restore normal functionality.  During this time the ufs-12-a and ufs-12-b may experience an outage for up to 45 minutes.


Update: 8am Maintenance is running long that expected.  The file system should be back online within a half hour.

HPCC Job Status E-Mails

In an effort to reduce the volume of email the MSU HPCC sends to users, we’ve recently turned on a feature that combines job status emails sent within one minute into one message.  When using the mail setting in a job script, ie: #PBS -m <options>, each individual job will not be sending out emails.  Instead, jobs will be grouped together and a ‘summary’ email will be sent for that group of jobs per minute.  If you’re using this output to track your job status, you’ll need to rework your email scripts to work with the summary format or use the following as the last command in your job scripts to ensure that each job will send out an email upon completion:

echo `qstat -f $PBS_JOBID`|  mail -s "PBS JOB $PBS_JOBID" $USER@msu.edu

On Monday July 10th 8am  we will be performing high availability maintenance on ufs-11-a and ufs-11-b.  Today we experienced an instance of high load which falsely caused the system to indicate the file system as down.  We will be clearing this error and do not anticipate any downtime for this system.  However as a precaution please be aware that these units could become unavailable for approximately 20 minutes starting at 8am.


Update - as of 8:30am all file systems are back online and working as expected.  The ufs-11-a and ufs-11-b filers did require a reboot to clear a false error triggered on friday and were unavailable for about 25 minutes.


Our next regular scheduled maintenance window will be on Tuesday, May 16th from 6:00 AM to 5:00 PM. Progress updates will be posted on the HPCC Wiki and ICER social media.

Targeted work includes updates to the Torque resource manager and general system stability improvements.

Please contact us at https://contact.icer.msu.edu/contact if you have any questions or concerns.

UPDATE 11:10 AM - We have completed the file server and gateway work and access has been restored to the gateway and development nodes. The TORQUE server upgrade has been updated and is being deployed and tested on the cluster.

UPDATE 2:10 PM We have completed all of the work and scheduling has resumed. Please let us know if you have any issues with the system.

ufs-11-a experienced a hardware failure last night around 10 PM. An automated process failed to recover the system. This caused gateway to become unresponsive to new SSH connections until this morning. It was returned to service at 8:30 AM.

Due to system upgrades, RT and contact forms will be intermittently unavailable today from 10:30 AM to 5 PM. Users needing to contact ICER for technical support should use web support:

https://www.hipchat.com/gYlrQfgah

UPDATE: 4:30 PM. The work on contact and RT was completed at about 1:30 PM today.