My role at MSU is to help researchers best utilize computation in their research. I help at all levels of the process including grant writing, teaching, profiling, debugging, and general scientific workflow optimization to enable scaling of programs to run on larger systems. For example, I have taught a wide range of user seminars including seminars on parallel programming using makeFlow, OpenMP, MPI, Cuda and MATLAB Parallel toolkit; also language seminars in C, C++, FORTRAN, Python, MATLAB and Bash scripting; and tools seminars for using GIT, TotalView, ParaView, CMake, and Berkeley Lab Checkpoint/Restart (BLCR).
In one-on-one consulting appointments I have had exposure to many other scientific tools including Ansys tools, Mathematica, ENZO, YT, ImageJ, GAMES, NAMD, GROMACS, OpenMM, OpenFLOW, R, SAS, STATA and many others. I also lead projects to help expand the capabilities of our HPC system here at MSU. These projects includes the development of a portable X11/SSH USB key that we give to new users, a HPCC daily testing tool called easybutton, and an entire suite of tools designed to make HPC easier called powertools, which include scripts for HADOOP on Demand, BLCR on demand, render farms, Files as Semaphores and many others.
This blog is a place for me to share what I have learned working with Researchers.
Most Recent blog post (Click on title below to see more)
Sometimes we get jobs that stall out right at the beginning but do not error out until the walltime for the job has been exceeded. Users get an email saying their job "exceeds walltime" but when they check the output nothing (or very little) seems to have happened. The cause of this problem is highly dependent on what the job is doing. However, in some cases a simple resubmit of the job gets it working. The following scripts check to see if the program is running and automatically re-submits the job if their seems to be a problem.
These solutions are nice work arounds because, if it works, the scripts just restarts your job until it runs and gets the research done. However, using this hack does not get at the root of the problem. Actually there are two problems:
- Something is broken causing the job to hang. This could be a race condition in the code, a bad node, bad file I/O, bad network connections, etc. All depends on what the code is doing.
- Code hangs insteads of quitting and reporting an error. Well engineered code should not hang. For example, file and network access should have timeouts so that code is not running forever.
Researchers, should first notify the HPCC if they are using this hack so we can try to track down problems with the nodes. Researchers should also work to modify their code to report an error if something hanges. This will also help track down the problem.
|Blog: Hack to automatically restart programs that stall during inicialization||Dirk Joel-Luchini Colbry||Jan 16, 2015|
|Blog: 2014-12-16 HPCC workshop slides and handouts||Dirk Joel-Luchini Colbry||Dec 15, 2014|
|Blog: 2014-12-05 Western Michigan University, Introduction to iCER slides||Dirk Joel-Luchini Colbry||Dec 04, 2014|
|Blog: zsh job number autocomplete||Dirk Joel-Luchini Colbry||Nov 16, 2014|
|Blog: 2014-10-23 Advanced High Performance Computing||Dirk Joel-Luchini Colbry||Oct 23, 2014|
|Blog: 2014-10-02: Intro to HPCC class presentation for ChE/MSE 802||Dirk Joel-Luchini Colbry||Oct 08, 2014|
|Blog: CSE 891 Section 1: Parallel Computing: Fundamentals and Applications||Dirk Joel-Luchini Colbry||Sep 10, 2014|
|Blog: 2014-2015 New Faculty Orientation||Dirk Joel-Luchini Colbry||Aug 25, 2014|
|Blog: 2014-08-20: EDAMAME Workshop at Kellogg Biological Center||Dirk Joel-Luchini Colbry||Aug 19, 2014|
|Blog: 2014-05-07: Workshop on Managing, Sharing and Moving Big Data||Dirk Joel-Luchini Colbry||May 07, 2014|