Skip to end of metadata
Go to start of metadata

Announcement History

Last changed Feb 06, 2016 10:51 by Sharan Kalwani

HPCC is up and running. 

If you see an issue, please open a ticket at https://contact.icer.msu.edu/contact

 

Posted at Feb 01, 2016 by Sharan Kalwani | 0 comments
Last changed Jan 15, 2016 18:07 by Andrew R Keen
Labels: maintenance

The HPCC is currently experiencing login issues.  We’re addressing the issue and will post more information as it’s available.

UPDATE: 1/15/2016 6 PM.

Yesterday, one of the servers that provide identity and authorization data (LDAP and DNS) experienced a hardware failure. While we normally have enough capacity to easily support the full cluster on a single node, we were experiencing an abnormally high number of requests from a user with a misconfigured array job, which saturated the remaining server.

We have quadrupled the hardware resources on the primary identity server and replaced the secondary identity server to ensure that there is sufficient capacity to ensure that the cluster remains available. We have also disabled the users responsible and will work with them to address their issues.

We are working to replace the LDAP and DNS servers with a new architecture that will be more resilient.

Most running jobs should have continued to run, but jobs starting or ending may have experienced failures; users who had should have received an email about the problem. Please contact us if you experienced these failures and if there’s anything we can do to help.
Posted at Jan 14, 2016 by Jim Leikert | 0 comments
Last changed Dec 01, 2015 14:03 by Anthony Parker

We are currently experiencing an issue with the ls15/scratch filesystem. As a precautionary measure the Job Scheduler has been temporarily paused to prevent new work from starting until the ls15/scratch volume is stable. We are working on this issue and hope to have it resolved as soon as possible.

 

UPDATE 12/1:  The ls15/scratch filesystem is now operating normally.  The job scheduler is running normally. 

Posted at Nov 30, 2015 by Anthony Parker | 0 comments
Last changed Nov 26, 2015 00:36 by Andrew R Keen
Labels: news, maintenance

An emergency power maintenance outage in the Engineering Building will cause a iCER data center shut down the afternoon and evening of Wednesday, November 25.  All ICER services will be unavailable during this outage and jobs will not run. Most web-based resources for iCER will be unavailable as well. Services will be restored late Wednesday night. A reservation has been put into place that will prevent new jobs that would run during the maintenance window.

We regret that this outage will lead to running jobs that were started before we were notified will be killed.  We were only informed about the outage late Friday afternoon and are taking steps to help mitigate the effects on our user community. If you have questions about this power outage, please contact Todd Wilson, IPF project representative, at (517) 927-4612 or tdwilson@ipf.msu.edu, or Leisa Williams-Swedberg, IPF construction superintendent, at (517) 230-5613, or IPF Dispatch at (517) 353-1760. If you have questions about your jobs in ICER resources, please open a ticket with us at https://contact.icer.msu.edu, call us at 517-353-9309, or simply stop by the main iCER offices.  Our office hours are on Monday afternoon between 1-2 PM. We will try to assist you in any way possible.

We sincerely regret the lack of advance notice for this outage. Please let us know if you or your jobs will be adversely affected by the outage. We will do our best to help out.

 

UPDATE: 9:30 PM Gateway access has been restored and all file servers are operational. Job scheduling will resume shortly.

UPDATE: 9:56 PM Job scheduling has resumed.

UPDATE: 10:30 PM The control panel on the facility's chilled water system has failed due to the power outage. IPF is on site working to repair. Until the repairs are complete, scheduling is disabled. The gfx10 cluster has been powered off to reduce cooling load.

UPDATE: 12 AM The cooling system has returned to normal. We've resumed scheduling. Please contact us via https://contact.icer.msu.edu with any issues.

Posted at Nov 21, 2015 by Andrew R Keen | 0 comments
Last changed Oct 28, 2015 12:09 by Andrew R Keen
Labels: maintenance

New Scratch File System

The High Performance Computing Center has scheduled a maintenance window on Wednesday November 4 from 7 AM to 6 PM to migrate scratch data from lustre_scratch_2012 to our new 2 petabyte scratch filesystem (/mnt/ls15), which will be much faster and more reliable than the current system. No jobs will run during this migration, but home directories and research spaces will remain available, as will gateway and development nodes. In addition, directory names and paths are changing; scratch users will need to update their scripts.
A new directory structure will be implemented for the new scratch space, and files and directories will no longer be able to be created outside of this directory structure.
  • Personal scratch space directories will be available at /mnt/ls15/scratch/users/<username>, or via the $SCRATCH environment variable. 
  • Research group scratch directories will be available at /mnt/ls15/scratch/groups/<groupname>. and jobs that reference the old paths with fail. To request a /mnt/ls15/scratch/groups subdirectory, please open a ticket at https://contact.icer.msu.edu/ 
Permissions and ownership for user and group directories will be identical to the default ownership and permissions on home directories and research spaces, respectively.
If you use the old file paths in any of your job scripts, PLEASE update them to use the $SCRATCH environment variable instead. Any empty directories that are older than 45 days will also be removed. If there are directories that your jobs need in scratch, please ensure that your job script ensures that they exist.

If you have any questions about this migration, please open a ticket at https://contact.icer.msu.edu/ .
Access to Globus Online and the development nodes may be interrupted during the outage. Any jobs left in the queue that reference /mnt/scratch, /mnt/ls12, or /mnt/lustre_scratch_2012 will be held for one week and then deleted. 

Data Migration Details:

  • Top-level (/mnt/lustre_scratch_2012/) directories with names exactly matching and owned by a username (/mnt/lustre_scratch_2012/sparty) will have their contents transferred to /mnt/ls15/scratch/users/<username>
  • Top-level (/mnt/lustre_scratch_2012/) directories with names exactly matching research space group names (/mnt/lustre_scratch_2012/redcedarlab) will have their contents transferred to /mnt/ls15/scratch/groups/<groupname>
  • Any top-level files (/mnt/lustre_scratch_2012/) or directories that do not match the names of a research space group or a username will be moved to the file's or directory's owner's personal scratch directory (/mnt/ls15/scratch/users/<username>).

If you have directories on scratch that have different migration requirements, please open a ticket at https://contact.icer.msu.edu/ .

FAQ

But why is the HPCC making these changes? This seems like it's making it harder to use!

There are over 2800 directories in the top level of our current scratch space. This caused performance and workflow issues. By implementing this change users will have a more structured environment for their data, improving tracking and top-level performance.

My Scratch directory structure is missing! How do I recreate it?

If our purge has deleted an empty directory structure you need, it's pretty easy to recreate. Use the following to create it:
mkdir -p ${SCRATCH}/my/directory/structure
replacing "my/directory/structure" with the scratch directory paths that you need.
Posted at Oct 27, 2015 by Andrew R Keen | 0 comments
Labels: maintenance

One of the home directory file servers (ufs-09-a) experienced a problem with high load from 6 AM to 9 AM today that may have prevented users from logging into gateway. From 8 AM to 11:30 AM, the scheduler was paused while diagnostic and recovery work was underway. The scheduler was resumed at 11:30 AM and the system is operating normally.

Posted at Oct 17, 2015 by Andrew R Keen | 0 comments

The HPCC will be applying security updates to our services this afternoon, Monday October 5th, 2015.  Users may experience slight delays as these updates are applied.

 

Posted at Oct 05, 2015 by Jim Leikert | 0 comments
Last changed Sep 10, 2015 08:39 by Matthew Bryan Scholz
Labels: maintenance

2015-08-05: The new licenses have been installed, and tested. If you see additional problems, contact us:

https://contact.icer.msu.edu

 

 

2015-08-04:  Due to unexpected delays in processing the Ansys license renewal, our Ansys license is temporarily expired.  The new license has been provided to us, and we are currently working on restoring access. Jobs requiring Ansys will fail until the new licenses are installed.

We will update this page with more information when available.  

Posted at Aug 04, 2015 by Matthew Bryan Scholz | 0 comments
Last changed Sep 10, 2015 08:39 by Matthew Bryan Scholz
Labels: maintenance

All HPCC systems will be down on August 11th, 2015 from 7 AM to 5 PM for system upgrades that will improve performance and stability.

  • We will be installing our new high performance scratch file system, greatly increasing performance, capacity, and reliability. 
  • We'll be upgrading our core network switches and firewall's operating systems and updating the layout to improve performance and prepare for the new 2016 cluster.
  • We'll be updating the kernel and some support libraries on the clusters to address some stability and performance issues. 


Please note: This means that jobs with walltime that would overlap with this outage time will not start until after the outage is complete.  Week long jobs will not start after August 4, until after the outage.

More information about these changes and their potential impact on users will be posted on the HPCC wiki prior to the outage.

UPDATE (5:15pm 08/11/2015): The maintenance has been completed, and we have opened up the gateway and development nodes for access. However, due to unforeseen issues encountered during the outage, scheduling of jobs will be intermittent or disabled until 10:00am ET tomorrow morning at the latest. Thank you for your patience.

UPDATE (10:15am 08/12/2015): Scheduling has resumed and we're monitoring the system for any discrepancies as load increases.  Please let us know if you have any questions or issues.

Posted at Aug 03, 2015 by Matthew Bryan Scholz | 0 comments
Last changed Aug 24, 2015 11:41 by Camille Alva Archer
Labels: msucoding, announcement

MSU Coding Group
Facilitator: Charles Ofria

Location: Mondays, 3-4pm, iCER/BEACON Seminar room, 1455A Biomedical and Physical Science (BPS)

The MSU Coding Group is targeted at people generally comfortable with programming to talk about software development techniques, new programming languages, algorithm design, or helping members deal with coding issues (find bugs, code reviews, etc.).  Each meeting will start with one or two informal talks on relevant topics (capped at one hour combined) followed by discussion and informal coding for those who want to stick around.

If you would like to continue receiving weekly e-mails about the meetings, please sign up for our new Google Group here:
Below is the schedule for the next several weeks. Please note that August 10 will be the last meeting until the Fall semester; we'll start up again in September.

August 3:  Building, Testing and Sharing a C++ library, Luiz Irber

August 10: Introduction to the Scala Programming Language, James Daley


Posted at Aug 03, 2015 by Camille Alva Archer | 0 comments
Last changed Jul 30, 2015 17:17 by Anthony Parker
Labels: announcement, maintenance

UPDATE 5:16pm  - We have successfully mitigated the high-load and the HPCC systems are running normally.  The Job Scheduler has been resumed. 

--

Thurs July 30, 3:45pm  – The HPCC is currently seeing excessively-high load on some of the home directory filers at this time.  This is causing very slow responses for user commands and logins.  We are working to mitigate the issue as quickly as possible.  The Job Scheduler has been temporarily paused during this time.

Posted at Jul 30, 2015 by Matthew Bryan Scholz | 0 comments
Last changed Jul 27, 2015 12:54 by Anthony Parker
Labels: announcement

RESOLVED 12:53pm – The excessive load on the filer was mitigated successfully.  Systems are operating normally at this time.  The Scheduler has been resumed. Please let us know if you experience any difficulties. 

--

UPDATE  12:07pm  – We are continuing to isolate the primary cause of the excessive load on one the home directory filers.  Commands and logins may still be slow, and Windows-Samba (SMB) connections may be difficult as the filer recovers.  The Job Scheduler remains paused at this time. 

--

Mon Jul 27  11:43am  – The HPCC is currently seeing excessively-high load on some of the home directory filers at this time.  This is causing very slow responses for user commands and logins.  We are working to mitigate the issue as quickly as possible.  The Job Scheduler has been temporarily paused during this time. 

 

 

Posted at Jul 27, 2015 by Anthony Parker | 0 comments
Last changed Sep 10, 2015 08:38 by Matthew Bryan Scholz
Labels: maintenance

2015-09-10: After the outage last month, several changes were put in place to improve the scheduler stability.  While it is still possible for users to overutilize the home/research directories and cause these issues, we have seen a marked decrease in this problem.  Thank you all for your cooperation.


We are currently experiencing an issue with the communication between the two components of our scheduler: Moab and TORQUE.  Due to issues with our high-speed scratch storage server, individual compute nodes can enter a state that causes Moab (the resource manager) to enter a non-responsive state.  This can cause a cascading effect, causing Moab to cease to respond properly, to both users, and to TORQUE.  This has a number of effects on the cluster:

1) New jobs that are submitted during this state are recorded by TORQUE, but not transmitted to Moab.  This will prevent them from being scheduled, and keep users from being able to check their jobs' start times.  These jobs are in the database, and will schedule once we have restarted Moab and returned the clients to their proper state.  *Deleting and resubmitting jobs will not solve this issue

2) Jobs that have finished running do not report their status to TORQUE, causing the status displayed by qstat to include a negative walltime.  This can cause confusion.  These jobs were properly handled by the cluster, but have not reported their final status back to the servers.  When Moab has been returned to service, these will resolve.

3) Tools requesitng information from Moab will not give proper information, tools that have known issues in this regard include (not a complete list):
mdiag
showq
showstart
checkjob
qstat will respond, but will not be updated until Moab has been returned to proper functioning.

As a general statement, please be aware that this issue does not impact jobs that are running on the system, and will not cause jobs to be lost.  When the communication between the resource manager (Moab) and the scheduler (Torque) is broken, the result is that NEW jobs will not be scheduled, and non-active jobs (those that are listed as "Idle" but not "Eligible" by showq) will not enter the active state and be ready for scheduling.  Additionally, jobs submitted to the cluster during this time will not appear in the outputs of showq at all.  
We are working on this issue, and hope to resolve it as quickly as possible.  Please feel free to inform us via our contact form if you see any behaviors that may be related to this issue, and we will answer any concerns as best we can.


Steps we are taking to resolve this issue:
- We have worked with Intel to identify a known bug in Lustre that is causing these issues, and have put together a roadmap to install the patches for it. 
- We are developing tests to rapidly recognize when these disconnections occur, and address the issue.
- We are working with the scheduler vendor to identify and implement various improvements to issues that are being uncovered by these issues. The vendor has already included several fixes identified at MSU for inclusion in the next major release of Moab and TORQUE.
- We are installing a new, much larger and much faster Scratch space this summer. It will be available for users sometime this fall, after it is fully configured and tested. It will be running a much newer version of Lustre, and comes with a much better vendor support agreement than the previous scratch space.
Posted at Jul 07, 2015 by Matthew Bryan Scholz | 0 comments
Last changed Jul 20, 2015 12:18 by Camille Alva Archer
Labels: msucoding, announcement

MSU Coding Group
Facilitator: Charles Ofria

Location: Mondays, 3-4pm, iCER/BEACON Seminar room, 1455A Biomedical and Physical Science (BPS)

The MSU Coding Group is targeted at people generally comfortable with programming to talk about software development techniques, new programming languages, algorithm design, or helping members deal with coding issues (find bugs, code reviews, etc.).  Each meeting will start with one or two informal talks on relevant topics (capped at one hour combined) followed by discussion and informal coding for those who want to stick around.

If you would like to continue receiving weekly e-mails about the meetings, please sign up for our new Google Group here:
Below is the schedule for the next several weeks. 
June 29: Introduction to Emscripten (C++ to Javascript Compiler), Charles Ofria
July 6
:    Introduction to D3 (Javascript Graphing Library), Emily Dolson
               Introduction to C3 (Javascript Chart Library built on D3), Anya Johnson
July 13:  Introduction to OpenSCAD (3D modeling for programmers), Tim Schmidt
July 20:  The use of Python packages, including the iPython Notebook, Matplotlib for graphs, Pandas for data analysis, Numpy, Statsmodels, RPy2 for interaction with GNU R, and PyFANN for interaction with the Fast Artificial Neural Network library, Wesley R. Elsberry
July 27:  Introductions to Jupyter and Flask, Bill Punch
Posted at Jul 01, 2015 by Camille Alva Archer | 0 comments
Last changed Jun 24, 2015 16:54 by Greg Mason
Labels: maintenance

 

  • 4:07pm - The Lustre file system has been restored and the scheduler has been resumed.  Normal jobs are processing at this time. Most or all jobs that were running on scratch may have experienced disk IO errors, and may have failed. Please resubmit these jobs.

  • 12:46pm - The MetaData Servers which host the Lustre Filesystem (a.k.a. the Scratch space) are down at this time.  We are working with the vendor to restore service as soon as possible.  In the interim we have paused the scheduler in order to keep queued jobs from failing.  

  • 10:08am – The HPCC is currently experiencing an issue with the Lustre Filesystem Client (a.k.a. the Scratch space) which is causing the job scheduler to be slow to respond or timeout.  This is also causing issues with our configuration management systems which has resulted in the temporary reduction in job capacity.  We are working to resolve this issue and return to normal operating capacity as soon as possible. 
Posted at Jun 24, 2015 by Anthony Parker | 0 comments

 

  • No labels