$resultHtml

Blog

At 7:00AM on Friday, December 14th, file systems served by the UFS-11-a and UFS-11-b filers will be unavailable while fault tolerance is restored. Restoration is estimated to take half an hour.

At 7:00AM on Thursday, December 13th, file-systems on ufs-12-a will be unavailable while server redundancy is restored. Restoration is estimated to take approximately 30 minutes.

On Friday, December 7th, we encountered an issue with the ufs-12-a and ufs-12-b filers that affected their availability between 2:30PM and 4:00PM. This issue has been resolved.

      In November 2018 We installed a new, faster and more reliable scratch system based on the IBM General Parallel File System (GPFS).  In addition the new scratch system is housed in the new MSU data center near our cluster and has a highspeed network connection.  Our previous scratch system, installed in 2015 based on Lustre file system (LFS), has a very slow connection to the clusters.  

      In December 2018, They are both available to you from two different paths. However, only the new scratch space (gs18) is connected with infinite band.

      On December 4th, the variable $SCRATCH and the shourtcut /mnt/scratch/$USER changed and pointed to the new scratch system gs18: 

  old scratch space: /mnt/ls15/scratch/users/$USER  (or use variable $SC15)

new scratch space: /mnt/gs18/scratch/users/$USER (or use variable $SC18 or $SCRATCH) or the shortcut  /mnt/scratch/$USER

where $USER is the system variable of your HPCC user id.

      Please notice the gateway rsync.hpcc.msu.edu can only access to the old scratch space. If you use the gateway to do file transfer to scratch space, files can only save to the old one (ls15). You will need to copy them to the new one (gs18) through a dev node.


Transition from ls15 to gs18

      ICER is not moving files for users from the old to the new system - users are responsible to do that.  We are recommending users to transition from the old scratch space to the new one as soon as they can. Users can copy necessary files from ls15 to gs18 by 'cp' command:

$ cp -av /mnt/ls15/scratch/users/$USER /mnt/gs18/scratch/users/$USER

or 'rsync' command:

$ rsync -av /mnt/ls15/scratch/users/$USER/ /mnt/gs18/scratch/users/$USER/

rsync can be restarted and won't recopy files if you have to stop if or if it fails/interrupted during the copy.

      You can do this from any dev node. If you have many files this will take a long time so please plan to have your computer open during that time or use virtual terminal (such as screen). We are happy to help if you have any concerns or difficulty.  Please contact or visit us during office hours Mondays 1-2pm or Thursdays 1-2pm 1440 BPS building.


Temporary Workaround

      If your scripts depend on the $SCRATCH variable to point to the old scratch, as an immediate measure you can change it back to the old value:

$ export SCRATCH=/mnt/ls15/scratch/users/$USER

We don't recommend this. (We recommend copying your files to gs18 or use $SC15 instead.) However, if your transition to ls15 takes a long time, you could add this command to your scripts or when you log-in.



The SLURM server will be going down for maintenance from 9:00PM to 11:00PM on Sunday, November 11th. New jobs cannot be submitted during this time and all SLURM commands (sbatch/salloc/srun/squeue etc.) will be unavailable. Running jobs will not be affected by this downtime.

Update: The SLURM server is back online and accepting jobs.

Samba pushed an update the broke mapping connections to our file storage systems.  This has been corrected and drive should mount properly now.

Update Nov 1, 8:00 am: most users will not be able to log-in Thursday Nov 1 during file system restoration.   We will update as soon as these systems are available.

On Monday night the Ufs-12-b filer experienced a kernal bug.  This caused a high availability event allowing ufs-12-a to provide the ufs-12-b data store.  On 11-1-18 we will need to take both file systems offline to restore normal functionality.  We anticipate approximate one hour of downtime for this file storage system.


Update: We have added ufs-13-a and ufs-13-b to this maintenance as well due to a HA fail over last evening.  


File systems Ufs-13-a and ufs-13-b are currently unavailable due to unexpected maintenance related to the intel16 cluster move.   Users with home directories on these file systems will not be able to log-in,  or you may not be able to see your files, and researcher folders may be unavailable.    However your files are safe and your accounts are ok.  An update will be provided when they return to service.



Update: 9:40am  Ufs-13-a and Ufs-13-b should now be available.

At 12:15 PM an Emergency Power Off was triggered by the fire system in the Engineering data center. After restoring power, one of the core network switches failed. HPCC staff are in the process of moving connections off of the failed switch and restoring the HPCC environment. The next update will be posted by 6:30 PM. 

UPDATE 6:30 PM: Most of the network connections have been restored. Gateways are currently available as are the ufs-13 file servers. We are working on an issue with the home directory servers ufs-11 and ufs-12. We will continue to return the environment to service; the next update will be 7:30 PM.

UPDATE 7:40 PM: All home directory servers have been returned to service. We are testing the Infiniband fabric and compute nodes and expect that we'll be able to resume scheduling soon. The next update will be at or before 8:30 PM.

UPDATE 8:35 PM Scheduling has been resumed; most systems should have returned to service. Please let us know if you have any problems.

SLURM Update Successful

The 10/10/2018 SLURM update was a success. SLURM is back online and accepting jobs.

The SLURM scheduler will be offline on Wednesday 10/10/2018 from 9:00PM to 11:00PM for an upgrade. New jobs cannot be submitted during this time and all SLURM commands (squeue, sinfo, etc.) will be unavailable. Currently running jobs will be unaffected by this outage.

The new scratch system at /mnt/gs18 may be temporarily unavailable today for ongoing testing. We will post updates once that issue is resolved.

On the week of October 22nd, we will be moving the Intel14 cluster over to the new data center. No jobs will run on this cluster during this outage. Intel16 and Intel18 will remain available during this transition. The cluster will return to service by the end of the week. Please let us know if you have any concerns.

UPDATE: Please note that the ffs17 file system will not be available on intel14 cluster from the 22nd until the Intel16 migration is complete on the week of the 29th.

UPDATE: 10/25 4:58 PM The 2014 cluster has been returned to service. The /mnt/gs18 scratch file system is available now.

On the week of October 29th, we will be moving the Intel16 cluster over to the new data center. No jobs will run on this cluster during this outage. Intel14 and Intel18 will remain available during this transition. The cluster will return to service by the end of the week. Please let us know if you have any concerns. 

All Intel16 nodes have been returned to service.

dev-intel18 will be unavailable from 7:30 to 8:30 AM on October 2nd for a network hardware upgrade. We apologize for any inconvenience. lac-249 will remain available from the other dev nodes as a CentOS7 node during the outage.