Please let us know if need immediate assistance by filing a ticket with us at https://contact.icer.msu.edu. You can also reach us at (517) 353-9309, or stop by the iCER office.
Institute for Cyber-Enabled Research
Michigan State University
Biomedical & Physical Sciences Bldg
567 Wilson Road, Room 1440
East Lansing, MI 48824
Phone: (517) 353-9309
We've had a number of issues with our Lustre file system (/mnt/ls15 or /mnt/scratch) over the past few months. We've put together this blog post to summarize them. If you are currently experiencing problems that you suspect are related to the scratch file system, please open a ticket with us.
- Metadata server full. On November 21st, the 270 million file capacity of Lustre was reached. Due to a bug in Lustre with ZFS, it is not possible to delete files on a full file system. We were able to bring additional capacity online to delete files and return to service that afternoon. We have implemented a quota, as previously announced in August. We are also currently purging any files older than 45 days as previously announced.
- Quota: Your current quota on scratch/ls15 is the number of files you have on scratch + 1 million. The previously announced quota of 1 million files will be phased in over the next few weeks; affected users will be notified by email before enforcement happens. We will also begin phasing in a 50 TB quota on scratch/ls15. Home directories and research spaces are unaffected.
Misreporting quota. About 1/8th of our users have an incorrect quota reporting due to a Lustre bug. Please contact us if you get a Disk Quota Exceeded errors or are unable to access scratch. You can check your quota with the command:
from a development node. If you see that you have 16 EB of data or 16x1018 files in use or see an * in the output, please contact us.
- Metadata performance: We have had multiple issues in the previous few months where users submit many hundreds of jobs that generate and delete tens of thousands of files per minute. These operations are very expensive; Lustre is not optimized for small file IO. The file system can become unresponsive to other users while this is happening. Files on scratch should only be 1 MB or more; for optimum performance at least 1 GB per file. Status: We continue to work with users to educate and identify problematic workloads.
- Intel14 fabric links. There were a few missing links on the Intel14 fabric that were degrading performance. They have been replaced.
- Intel14 firmware update. We are tracking a communication error on the Intel14 cluster. The Intel14 cluster is running an older version of the Infiniband network firmware. We are in the process of updating these nodes.
- No space left on device or Input/Output error. We believe that these are caused by communication issues on the network fabric. Early testing has shown improvements with the firmware mentioned in the previous bullet point. If you have these errors on scratch please contact us.
Missing files?. Some users have reported messages from email@example.com with the subject line "PBS JOB 31234567.mgr04" that have messages like:
These can indicate a problem with scratch or with TORQUE, Moab, the home directory system, or the nodes your job was running on. Please examine your job script and program outputs for more specific error messages and let us know if you see this issue.
- 2TB file size limit: If you are attempting to write more than 2 TB to as single file, you will need to adjust the stripe setting of the file or the directory to spread across multiple server targets using the lfs setstripe option. Please contact us for more information.
- Problem with dev-intel14. Users have reported IO errors on dev-intel14 specifically. Please report the problem and try another development node if you continue to have an issue.
- Intel16: Lustre performance. Until our August maintenance window, Lustre was running over the Ethernet interfaces on the Intel16 cluster. Single node Lustre performance suffered. This has been resolved.
- Intel16: Fabric Unbalanced. Due to a physical limitation, all of the Lustre storage servers on the intel16 network were not balanced throughout the fabric. At times of high traffic the Lustre server would lose communication to the intel16 cluster. The fabric has been rebalanced and we are no longer seeing traffic contention on that switch.
- Failover configuration after update. During our August maintenance window, we identified a bug in the Lustre software that prevented high availability features from functioning. This was patched in September.
Update: Maintenance complete and all file services are restored. We are monitoring file systems carefully. Please contact us with any concerns.
Update: Systems will remain off-line for an additional hour from the original notice. We are planning to return to service at 4:30 pm today. We are sorry for the inconvenience and thank you for your patience.
File systems are currently off-line and users are unable to log-in, connect for drive mapping/File sharing, remote desktop, file transfer or use Globus services per our previous annoucemement on https://wiki.hpcc.msu.edu/x/yAKpAQ
Update 4:00pm : SCRATCH service has been restored.
The Lustre/SCRATCH system is currently unavailable while we mitigate a problem with the metadata server, which has filled unexpectedly. An unfortunate symptom is that users are not able to read, write, or delete files at this time. We are sorry for the interruption and will update this when the problem is solved. Thank you for your patience.
UPDATE : The maintenance has been extended until 4:30pm EST, 11/22/2016
NOTICE: All file systems will be closed during this outage and users will not be able to log-in during this time.
The HPCC will be conducting system maintenance on User File Servers next Tuesday, November 22, 2016, at 12:30pm until 3:30pm (all times Eastern).
This maintenance window is needed to address a system stability issue with these filers.
User jobs that can complete before the maintenance window will continue to run. User jobs that will cross into the maintenance window will be held until after the maintenance window. Job scheduling will return to normal after the maintenance window.
Users will not have access to the system during this maintenance window. This restriction is unfortunately needed to ensure stability during the maintenance.
Please contact us at https://contact.icer.msu.edu/contact if you have any questions or concerns.
A failed hardware component on ufs-12-a took it offline at 10:45 PM on 11/14/2015. At about 1 AM on 11/15, ufs-12-b attempted to take over but shut itself down due to a perceived hardware issue. Users and research spaces on ufs-12-a and ufs-12-b will be unable to access the system until access is restored.
UPDATE 16 NOV 09:06am Access to filesystems hosted on ufs-12-a & b has been restored.
All users have now been migrated off the ufs-10-a and ufs-10-b filers. Due to the age and limitation of resources, these filers were responsible for causing a large amount of instability and causing the cluster at times to become unresponsive. With the addition of the new filers we have doubled the resources available. This will increase the stability of our home directory storage.
Migrations of users on ufs-09-a will begin when our next set of new storage is available.
We recently applied a fix to address an urgent Linux kernel security issue. Users have reported that some MPI processes may have been terminated when the patch was applied. New MPI jobs appear to launch successfully. gdb and other debuggers will be disabled until the permanent security fix is applied next week.
From 1:40 to 2 PM today (10-21) access to home directories and research spaces on ufs-11-a was interrupted due to an administrator error. Users may have seen the following error message:
"NFS: Stale File Handle"
Jobs attempting to access ufs-11-a may have failed. We're sorry for any problems that this caused.
File system maintenance is on-going, and the system is currently unavailable today for many users who had either home directories or shared research directories on the 10-b file system (approximately 700 users affected).
Those affected can't log-in. See https://wiki.hpcc.msu.edu/x/VQKpAQ for details. Affected users were sent two message indicating they would not be able to log-in today.
Update 1pm: The first 200 users have been migrated and their users have been unblocked and released. There are another ~500 users still to migrate, which we expect to be completed this evening.
Update 7:35 PM All accounts that we were able to migrate have been migrated. All users should have access to their new home directories within two hours. Emails will be sent tomorrow with instructions for updating samba mounts if you are using them.
One of our home directory servers (ufs-10-b) has had a number of problems in the last few weeks. These problems have caused the HPCC to become unavailable and jobs to fail. To resolve this, we are moving all active users on ufs-10-b to new home directory servers.
- No submitted jobs for ufs-10-b users that would run during the maintenance window will start; they will start after the maintenance is complete. ufs-10-b users can continue to run jobs that will end before the maintenance.
- ufs-10-b users will not be able to use SSH, Globus, RDP, or rsync to access the HPCC on the 19th.
- Windows file sharing (Samba/smb) access to ufs-10-b will be unavailable on the 19th. You will need to update your client's mount to point at the new file server after the maintenance is complete.
- Any access to research spaces that are on ufs-10-b will be disabled until the migration is complete.
To check a research space (replace myresearchspace with the name of the research space)
If either returns 'ufs-10-b.i' you will be affected.
The HPCC will be intermittently available from 8 AM to 5 PM Tuesday, October 4th. No scheduled jobs will be run during the window.
- Cluster expansion: An additional intel16/Laconia rack will be installed. This will add an additional 2000 cores to Laconia.
- Lustre (/mnt/scratch) performance and stability improvements: A bugfix release will be applied to the Lustre servers and the network configuration will be adjusted.
- Home directory migration: Active users on ufs-10-a will be migrated to our new file servers. This will significantly improve performance and user experience.
We anticipate that Gateway and development nodes will be available throughout the maintenance with one to two interruptions for reboots. Users on ufs-10-a may have their access blocked until the migration is complete. Lustre will be unavailable until the Lustre-specific maintenance is complete.
If you have any questions please contact us: https://contact.icer.msu.edu/contact
Update: Tuesday, October 4th, 3:12pm EDT:
All of the planned work for the Storage and Cluster Maintenance Outage has been completed and we are expected to resume running jobs around 4:00pm EDT today.
Update: 5:15 PM The scheduler was resumed and the maintenance was completed at 4:30 PM.
We are currently experiencing issues with scratch on the globus and rsync gateway nodes. Until further notice, please transfer any data to a research space, then transfer this data to scratch from a dev node. We understand this is a major inconvenience, and are working to restore scratch functionality on the globus and rsync gateway nodes as quickly as possible. During this time Globus and file transfer services through the rsync.hpcc.msu.edu will have interruptions.
Update Sept 19 4:15pm Connection to scratch from rsync/ file transfer gateway and Globus have been restored. However issues remain and while we work on a more stable solution, there may be intermittant issues. Thanks for your patience.
There is a bug that is preventing high availability from working properly on Lustre (/mnt/scratch). Users may notice pauses when accessing /mnt/scratch when one of the back-end Lustre servers becomes unresponsive. We are working with the vendor and Intel to address the problem. Pending I/O should wait until the node is repaired, but we have received reports of failures when the recovery was unable to process the pending client requests and clients receive a write error. We are working with the vendor and Intel to address this issue.
File servers ufs-11-a and ufs-11-b went offline at 10:00 AM due to a high availability software configuration issue. We are working to restore service by 10:30 AM.
UPDATE: Service intermittently available between 10:30 and 11 AM and was restored by 11 AM. We apologize for any problems this may have caused.