Blog

Today we will begin correcting file and directory ownership on all research spaces.  Please note this process will take up to several weeks to complete.  We will be contacting any users with large amounts of files which may cause research directories to become over quota before correcting ownership.

LS15 Maintenance

4/23/20 at 8am we will be performing maintenance to correct issues that are causing slow performance on our ls15 system.  The system will be slow to unresponsive during this time.  Maintenance is expected to be completed in less than two hours.

      The slowness issue is resolved.

     One of the OSS servers on ls15 (/mnt/ls15) scratch file system is slow to respond to lock requests from the MDS server. We are working on replacing that drive at the moment. It will take a while to be complete. We will update this announcement once it is back to normal.

       Update on 03:05pm, the issue is resolved.

       ls15 scratch system (/mnt/ls15) is currently having a issue and we are working on it. We will update this information when it is back normal.

We will be performing emergency maintenance at 7am on Friday 4-3-20 to all gateways and development nodes.  This will require home directories to be taken offline on those nodes and the nodes rebooted.  We expect maintenance to be complete by 8 am

Starting this morning we will be performing a patch upgrade to our home directory system.  This has been provided by our vendor to correct issues with quota functionality.  You may see some pauses in the system while components are restarted.  


Update 4-1-20 All maintenance is complete.

As part of MSU’s response to COVID-19, ICER is transitioning to an online-only support model. All HPCC services will continue as normal.


We are currently experiencing issues with our network that is causing slow or broken connections to several of our login and transfer systems.  We are looking into this issue and will provide updates as available.


Update 2/12/20 1:20pm  The issue is currently resolved.  We will be monitoring our network for further issues.


Update 4-2-20:  After patching the system a quota check has successfully run.  We believe currently that quotas reported are now correct.  

Important Note:  We are seeing about 50 research spaces over quota.  This is likely due to previous under reported quotas.  We have checked these with the DU functionality and they appear to be reporting properly.  Please remember that if you are storming large amounts of small files that the reported quota will not match DU due to system block size limitations.

Note:  We have ensured all default quotas are now enforced on research groups.  If you are having trouble with your research group please open a ticket and we will assist you.


      Currently our home file system check quota function will sometimes cause a users directory to have an incorrect quota. If you see this please open a ticket and we will work with you to temporarily increase your quota. We continue to work with our vendor to correct this issue.


Update 4-1-20:  We have received a patch and are testing to see if all quota issues have been resolved.   

UPDATE (9:52 PM): The maintenance is complete and filesystems are remounted on the gateways

UPDATE: This outage is now scheduled for February 8th

On Saturday, February 8th, there will be a partial outage of HPCC storage starting at 8:00PM. This outage will begin with rolling reboots of all gateways and will interrupt storage access on the login gateways and rsync gateway only. This may cause 'cannot chrdir to directory' errors when connecting to the HPCC. Users can continue to connect from gateways to development nodes to conduct their work. Development nodes and running jobs will not be affected. This outage is expected to last several hours.

Hpcc login issues 12-24-19

At 10am today our gateways are no long able to communicate with our home storage system.  We are looking into the issue and will rectify it as soon as possible.  Compute nodes can mount the home system properly and jobs will continue to run properly.



Starting at 8am on 12-12-19 our vendor is performing updates to the software on our home system.  We do not anticipate any downtime associated with the upgrades.


Update 12-13-19 1pm: Updates have been completed on 4 of 6 storage nodes.  We anticipate the remaining storage nodes to be complete by end of day.  Updates on protocal nodes will continue on Monday 12-16.  When all existing equipment is upgraded we will be adding an additional storage block.

Update 12-13-19 4pm: On monday 12-16-19 users will see periodic samba home directory mount outages.

Update 12-17-19 8am:  Upgrade work on our existing home storage system is complete.  We will be adding additional storage to the system on 12-17 and 12-18.  During out 12-19 outage all compute node clients will have software updates to match the storage cluster.  During the outage we will also be replacing our AFM backup nodes with new hardware for better backup and over all system stability.

The HPCC is undergoing an upgrade of the GS18 scratch. No service interruptions are expected.


2019-21-12 all upgrades on scratch cluster are now complete

Today at 3:00 dev-intel16-K80 will go down for maintenance.  Available GPUs are not correct

and a replacement card will resolve this issue.  We will have the system returned to service as

soon as possible.

UPDATE:  Dev-intel16-K80 is working and available now.

We have a new AMD-based Rome server with 128 cores and 512 GB of memory available to users. It is currently accessible as eval-epyc19 via ssh from any development node. We are considering this architecture for our 2020 cluster purchase and would like your feedback on any strengths or weaknesses you notice.

We've configured 4 NUMA clusters per socket (16 cores.) In early testing, a hybrid OpenMP-MPI model that uses OpenMP processes per-NUMA-domain or per-L3 cache and MPI between processes provides excellent performance. You can see the system layout with

lstopo-no-graphics

This node is a shared resource so please be considerate of other users and aware of the impact other users may have on your benchmarking. If you would like exclusive access, please contact us so we can coordinate that.