$resultHtml

Blog

The SLURM server will be going down for maintenance at 8:00PM on Sunday, February 17th. Maintenance will last about 15 minutes. During this time, no new jobs may be submitted to SLURM and client commands (squeue, sbatch, etc.) will be unavailable. Running jobs will not be affected by this downtime.


At 9:30 am we experienced a fail over on our ufs-13 storage system due to a possible kernel bug.  We will be taking both systems offline in an attempt to correct this issue.  We anticipate the downtime to be approximately 2 to 3 hours.  


Update 15:30 maintenance is taking longer than expected we will provide another update when complete

Update 16:44 maintenance is complete and all directories should now be available.  

We would like to remind users that files on /mnt/scratch with a modification time over 45 days are purged nightly.  If you have files that you need to preserve please consider moving them to your home directory.  


Security certificates for wiki.hpcc.msu.edu and rt.hpcc.msu.edu have expired.  We are working on replacing the certificates.  Please accept any warnings the sites present, until we are able to get the new certificates in place.

UPDATE:  The new security certificates are in place now.  You should not receive any warnings.

The SLURM server will be going down this Thursday at 8:00PM until approximately 9:00PM. We are taking this downtime to increase the resources allocated for the SLURM server. This will improve the overall stability of SLURM. Running jobs will not be affected by this downtime, however, no new jobs will start and SLURM client commands (srun, salloc, sbatch) will not work until the server is back online.


As of approximately 4:30pm, both filers ufs-13-a and ufs-13-b have been restored to full functionality. Users who continue to experience login troubles are encouraged to open a ticket at https://contact.icer.msu.edu/contact

At approximately 1:30pm on Thursday, January 31, the home filers ufs-13-a and ufs-13-b experienced a failure which caused the directories to become unavailable. We continue to work on restoring both of these to full functionality, and will follow this announcement with further updates as available.

On Friday, February 1, 2019, the HPCC will be performing memory upgrades to the following gateways:


gateway-00

gateway-02

gateway-03


The memory upgrades will be performed one at a time, and each host should take approximately five minutes. Upgrades will begin at 9:00am, and will be complete on all hosts by 9:30am.


Access to the HPCC will continue to be available during the upgrades, however, it is possible that some users may notice a slight increase in login times while the hosts are rebooted. 

On 02-03-19 our home file system experienced an issue causing the data store for ufs-13-b to be moved to ufs-13-a.  On 2-7-19 at 8am we will be rebooting both units to restore high availability functionality.  Typically this process takes about an hour.


Update: We have moved the reboot forward as the performance of our home system has suffered.  


Update 2-7-19 9:40 am, maintenance is now complete and the files systems are back online.

The Rsync gateway (Globus)  will be going down momentarily for a reboot tonight at 8:00PM.

We discovered the policy removing files with an mtime over 45 days has not been working properly.  We will be enabling this on Monday Febuary 4th.  Please be aware that any files not modified in 45 days will be removed, regularly start on this date.

Minor SLURM Update this Week

HPCC staff will be performing a minor update to the SLURM software starting Tuesday, January 15th, and completing on Thursday, January 17th. SLURM will be available during this time, however, brief (less than a minute) interruptions in the availability of SLURM commands may occur.  This update will fix minor bugs encountered by HPCC users. The update process will not impact running jobs.

UPDATE: 1/18 08:50 AM Scratch space ls15 is available on rsync gateway, dev nodes and compute nodes. If you have any question to access your scratch space. Please let us know.

UPDATE: 1/17 08:50 AM Sctach space 1s15 is available on rsync gateway (rsync.hpcc.msu.edu) and all compute nodes. It is still not available on all dev nodes yet. Our system administrators are working on the issue. Please wait for next update.

UPDATE: 1/10 11:15 AM The physical move and OS update has been completed on ls15. There is an issue with the Lustre software upgrade which has delayed the return to service. gs18 remains available until the issue has been resolved.


The HPCC downtime reservation on Monday, Jan 3rd is completed.

Scratch space /mnt/ls15 will be taken offline from Monday, Jan 7th to Thursday, Jan 10th for upgrade. If any work requires scratch space, please use the new scratch system /mnt/gs18 instead. Please check the scratch space transition site for how to move your work to /mnt/gs18.

If any submitted job possibly uses /mnt/ls15 scratch space from Jan 7th to Jan 10th, please modify the script and resubmit it. The following jobs have been on hold by HPCC administrators due to the ls15 offline:

Jobs on Hold:

5777369,5777370,5777373,5777374,5861807,5862078,5863186,5871676,5871678,5871680,5871683,
5871684,5871685,5871688,5871693,5871695,5871696,5871697,5871698,5871699,5871700,5888175,
5888238,5888387,5889726,5889967,5890135,5897035,5911266,5926822,5929189,5929196,5929244,
5929648,5929652,5929660,5929662



      HPCC has a downtime reserved for maintenance on Thursday, 01/03/2019. No job can run during the reservation time. Any submitted job supposed to be in running during the time is put off until the maintenance is over. A detail of the job reservation can be found by running the command "scontrol show reservation" on a dev node:

$ scontrol show reservation

ReservationName=root_25 StartTime=2019-01-03T06:00:00 EndTime=2019-01-03T18:00:00 Duration=12:00:00

   Nodes=cne-000,csm-[001-023],csn-[000-039],csp-[001-027],css-[001-127],ifi-[000-004],lac-[000-225,228-261,276-369,371-372,374-445],nvl-[000-006],qml-[000-005],skl-[000-143],vim-[000-002] NodeCnt=811 CoreCnt=23156 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES

   TRES=cpu=23156

   Users=root Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a

The gateways have been updated to CentOS 7 to match the rest of the cluster. When accessing the cluster you may see a warning regarding the host key being incorrect, we are working on resolving the host key issue.

For more information about how to fix this, see our  HPCC FAQ: known_hosts error