Skip to end of metadata
Go to start of metadata

My role at MSU is to help researchers best utilize computation in their research. I help at all levels of the process including grant writing, teaching, profiling, debugging, and general scientific workflow optimization to enable scaling of programs to run on larger systems. For example, I have taught a wide range of user seminars including seminars on parallel programming using makeFlow, OpenMP, MPI, Cuda and MATLAB Parallel toolkit; also language seminars in C, C++, FORTRAN, Python, MATLAB and Bash scripting; and tools seminars for using GIT, TotalView, ParaView, CMake, and Berkeley Lab Checkpoint/Restart (BLCR).

In one-on-one consulting appointments I have had exposure to many other scientific tools including Ansys tools, Mathematica, ENZO, YT, ImageJ, GAMES, NAMD, GROMACS, OpenMM, OpenFLOW, R, SAS, STATA and many others. I also lead projects to help expand the capabilities of our HPC system here at MSU. These projects includes the development of a portable X11/SSH USB key that we give to new users, a HPCC daily testing tool called easybutton, and an entire suite of tools designed to make HPC easier called powertools, which include scripts for HADOOP on Demand, BLCR on demand, render farms, Files as Semaphores and many others.

This blog is a place for me to share what I have learned working with Researchers.

Link to Full Blog

Most Recent blog post (Click on title below to see more)

Last changed Jan 16, 2015 11:39 by Dirk Joel-Luchini Colbry

Sometimes we get jobs that stall out right at the beginning but do not error out until the walltime for the job has been exceeded. Users get an email saying their job "exceeds walltime" but when they check the output nothing (or very little) seems to have happened. The cause of this problem is highly dependent on what the job is doing. However, in some cases a simple resubmit of the job gets it working. The following scripts check to see if the program is running and automatically re-submits the job if their seems to be a problem.


file_flag_example.qsuboutput_monitor_example.qsub $testfile ) & PID=$! # Sleep for enough time to start generating output sleep 300 linecount1=`cat $testfile | wc -l`   # Sleep enough for more output sleep 100 linecount2=`cat $testfile | wc -l` if [ "$linecount1" == "$linecount2" ] then echo "Job Seems to have stalled. Killing and restarting" kill $PID qsub $0 echo "Job stats for debugging" qstat -f ${PBS_JOBID} exit 1 fi wait $PID RET=$? qstat -f ${PBS_JOBID} #return the output of the main program exit $RET]]>qstat_monitor_example.qsub


These solutions are nice work arounds because, if it works, the scripts just restarts your job until it runs and gets the research done.  However, using this hack does not get at the root of the problem.  Actually there are two problems:

  1. Something is broken causing the job to hang. This could be a race condition in the code, a bad node, bad file I/O, bad network connections, etc. All depends on what the code is doing.
  2. Code hangs insteads of quitting and reporting an error.  Well engineered code should not hang. For example, file and network access should have timeouts so that code is not running forever. 

Researchers, should first notify the HPCC if they are using this hack so we can try to track down problems with the nodes.  Researchers should also work to modify their code to report an error if something hanges. This will also help track down the problem.

  • Dirk





Posted at Jan 16, 2015 by Dirk Joel-Luchini Colbry | 0 comments

All Recent Blog Posts

Title Author Date Posted
Blog: Hack to automatically restart programs that stall during inicialization Dirk Joel-Luchini Colbry Jan 16, 2015
Blog: 2014-12-16 HPCC workshop slides and handouts Dirk Joel-Luchini Colbry Dec 15, 2014
Blog: 2014-12-05 Western Michigan University, Introduction to iCER slides Dirk Joel-Luchini Colbry Dec 04, 2014
Blog: zsh job number autocomplete Dirk Joel-Luchini Colbry Nov 16, 2014
Blog: 2014-10-23 Advanced High Performance Computing Dirk Joel-Luchini Colbry Oct 23, 2014
Blog: 2014-10-02: Intro to HPCC class presentation for ChE/MSE 802 Dirk Joel-Luchini Colbry Oct 08, 2014
Blog: CSE 891 Section 1: Parallel Computing: Fundamentals and Applications Dirk Joel-Luchini Colbry Sep 10, 2014
Blog: 2014-2015 New Faculty Orientation Dirk Joel-Luchini Colbry Aug 25, 2014
Blog: 2014-08-20: EDAMAME Workshop at Kellogg Biological Center Dirk Joel-Luchini Colbry Aug 19, 2014
Blog: 2014-05-07: Workshop on Managing, Sharing and Moving Big Data Dirk Joel-Luchini Colbry May 07, 2014



  • No labels