Blog

Webrdp is currently offline.  We are looking into this and will provide updates when available.


Update 10::00 am   The webrdp server is now back online.

We are currently experiencing a network issue causing most of our nodes to be offline.  We are investigating and will provide updates as soon as possible.


Update: 9:15 am  ITS has resolved a network issue in MSU data center all nodes are now back online


Update 11:15 AM: A core data center switch failed at 1:04 AM this morning due to a bug that switch's controller software. As part of a redundant switch pair a single failure should not have taken the network offline, but the second switch did not did not successfully take over. We have identified why the second switch was unable to bring up the interface and are working to implement a fix that will prevent this from happening again.

On Saturday, June 20th, at 8:00PM, all SLURM clients on the HPCC will be updated. This includes components installed on the development and compute nodes. As a consequence of this update, any pending srun or salloc commands run interactively on the development nodes will be interrupted. Jobs queued with sbatch, and srun processes within those jobs will not be disrupted.  Please contact us at https://contact.icer.msu.edu/ if you have any questions.

A recent update of the SLURM scheduler introduced a potential bug when specifying job constraints. Specifying a certain constraints may yield a "Requested node configuration not available error". If you encounter this error when submitting jobs with constraints that worked prior to Wednesday, June 10th, update your constraint to specify the 'NOAUTO' flag, e.g. 'NOAUTO:intel16' instead of 'intel16'. This will circumvent the issue while we work with our software vendor for a permanent fix. Please contact us with any questions at https://contact.icer.msu.edu/.

Update 8:47 PM: The scheduler is back online

We are currently working with our software vendor to address an issue with our job scheduler. The scheduler is currently paused. SLURM client commands will not be available and new jobs will not start until this issue is resolved.

Today we will begin correcting file and directory ownership on all research spaces.  Please note this process will take up to several weeks to complete.  We will be contacting any users with large amounts of files which may cause research directories to become over quota before correcting ownership.

LS15 Maintenance

4/23/20 at 8am we will be performing maintenance to correct issues that are causing slow performance on our ls15 system.  The system will be slow to unresponsive during this time.  Maintenance is expected to be completed in less than two hours.

      The slowness issue is resolved.

     One of the OSS servers on ls15 (/mnt/ls15) scratch file system is slow to respond to lock requests from the MDS server. We are working on replacing that drive at the moment. It will take a while to be complete. We will update this announcement once it is back to normal.

       Update on 03:05pm, the issue is resolved.

       ls15 scratch system (/mnt/ls15) is currently having a issue and we are working on it. We will update this information when it is back normal.

We will be performing emergency maintenance at 7am on Friday 4-3-20 to all gateways and development nodes.  This will require home directories to be taken offline on those nodes and the nodes rebooted.  We expect maintenance to be complete by 8 am

Starting this morning we will be performing a patch upgrade to our home directory system.  This has been provided by our vendor to correct issues with quota functionality.  You may see some pauses in the system while components are restarted.  


Update 4-1-20 All maintenance is complete.

As part of MSU’s response to COVID-19, ICER is transitioning to an online-only support model. All HPCC services will continue as normal.


We are currently experiencing issues with our network that is causing slow or broken connections to several of our login and transfer systems.  We are looking into this issue and will provide updates as available.


Update 2/12/20 1:20pm  The issue is currently resolved.  We will be monitoring our network for further issues.


Update 4-2-20:  After patching the system a quota check has successfully run.  We believe currently that quotas reported are now correct.  

Important Note:  We are seeing about 50 research spaces over quota.  This is likely due to previous under reported quotas.  We have checked these with the DU functionality and they appear to be reporting properly.  Please remember that if you are storming large amounts of small files that the reported quota will not match DU due to system block size limitations.

Note:  We have ensured all default quotas are now enforced on research groups.  If you are having trouble with your research group please open a ticket and we will assist you.


      Currently our home file system check quota function will sometimes cause a users directory to have an incorrect quota. If you see this please open a ticket and we will work with you to temporarily increase your quota. We continue to work with our vendor to correct this issue.


Update 4-1-20:  We have received a patch and are testing to see if all quota issues have been resolved.   

UPDATE (9:52 PM): The maintenance is complete and filesystems are remounted on the gateways

UPDATE: This outage is now scheduled for February 8th

On Saturday, February 8th, there will be a partial outage of HPCC storage starting at 8:00PM. This outage will begin with rolling reboots of all gateways and will interrupt storage access on the login gateways and rsync gateway only. This may cause 'cannot chrdir to directory' errors when connecting to the HPCC. Users can continue to connect from gateways to development nodes to conduct their work. Development nodes and running jobs will not be affected. This outage is expected to last several hours.