We have a new AMD-based Rome server with 128 cores and 512 GB of memory available to users. It is currently accessible as eval-epyc19 via ssh from any development node. We are considering this architecture for our 2020 cluster purchase and would like your feedback on any strengths or weaknesses you notice.
We've configured 4 NUMA clusters per socket (16 cores.) In early testing, a hybrid OpenMP-MPI model that uses OpenMP processes per-NUMA-domain or per-L3 cache and MPI between processes provides excellent performance. You can see the system layout with
This node is a shared resource so please be considerate of other users and aware of the impact other users may have on your benchmarking. If you would like exclusive access, please contact us so we can coordinate that.
From the week of December 9th through early January, we will be performing software upgrades on our gs18 and ufs18 storage systems, which will improve performance and reliability. During this time, users may experience periodic pauses and degraded performance. We will update this blog post if there are any specific impacts users may see and as the work proceeds. Please contact us if you have any concerns.
Update: 3:25 PM 12/20: After the upgrade on 12/19, two new bugs were introduced. Users may experience "Stale File Handle" messages, slow home directory or research space access, or not be able to log into a gateway or dev node when this problem is occuring. The vendor is preparing fixes for us to deploy today or tomorrow and we have an understanding and workaround of what's triggering this problem to reduce the impact on our users. We're sorry for any impact that this has on your research.
On Thursday, December 19th, the HPCC will be undergo scheduled maintenance. We will be applying GPFS software updates, adding redundancy for the Infiniband fabric, and additional minor fixes and updates. We will be rebooting every compute node, so any jobs that would overlap will be held until after the outage. The entire system may be unavailable during this work. We will update this blog post as more information becomes available.
Reminder to users: please be vigilant about data kept on scratch. Temporary data on scratch should be kept clean and updated to help prevent the file system from becoming full.
Please see this blog post about additional file system work happening this December.
UPDATE: 6:20 PM The system maintenance has been completed.
Update (11:15 PM): The home file system is back online.
The HPCC is currently experiencing an unexpected outage of the home file system. We are currently working to resolve the issue.
After investigation we have found that quota enforcement for disk space usage has not been enforced properly. We will be correcting this on 11-21-19. We encourage users to check the disk usage versus quota and ensure that your research space is not over quota. Looking at the current space usage about 30% of research spaces will be over quota. We will be contacting the PI of each of the over quota spaces directly as well.