Author: Jason Pell
For this tutorial, we will be using the short read de novo assembler ABySS to assemble a bacterial genome.
For this tutorial you will need a couple of different software packages available on the HPCC. These include the ABySS assembler, and the Python package Screed. All of these packages are preinstalled and available for your use on the HPCC. To begin, login in to gateway, and then the development node dev-intel10. At the command prompt, enter:
You should now have your user environment configured for the tutorial.
Downloading Data Sets
First off, we’ll grab our datasets:
This data is an Illumina run of E. coli with 36bp paired-end reads and a 500bp insert size. A similar dataset with an insert size of 200bp is available under accession number SRR001665 from NCBI if you are interested in trying to assemble them together.
This job will take awhile to run, so it is recommended that you execute it as a cluster job. First, create a new subdirectory "k30" in your home account:
Now, create a job submission script called "abyssTest.sh" using the HPCC editor "nano":
To obtain statistics on the file you just generated, go back to your home directory and run:
Other Stuff to Try
The optimal value for k depends greatly on the dataset. A lower value for k has greater sensitivity, but can produce more false overlaps. However, it is the best option when you don’t have high coverage. On the other hand, a high value for k will have a more accurate assembly and longer contigs, but you are likely to miss a lot of potential read overlaps, which means you need higher coverage to make up for the difference.
Try varying the value for k by creating a new directory for each value that you would like to test. Then, run:
again to see how the assemblies compare to each other. If you generate a lot of assemblies, you can copy and paste the output to a text file and import it into Excel as a space-delimited file.
Finally, if you have your own dataset, you can try to assemble it on the HPCC.
See the next section for details on how to make your assemblies run faster with multiple cores.
Running ABySS in Parallel
This is where the real power of the HPCC comes into play - executing parallel jobs to decrease processing time. Some applications are written to use multiple processors while others are not. One of the biggest advantages to using ABySS is that most of its different processes can be run in parallel. This means that you can split the work onto several different cores or even machines using OpenMPI - which has been installed for your convenience on the HPCC. If you have a larger genome that you are interested in assembling, you will likely need to take advantage of this parallelism. To try this out, lets create a new subdirectory:
Next, we'll create a new job submission script called "paraAbyss.sh":
Notice that we have increased the value of "ppn" on the second line to "3" - this represents the number of cores we are requesting for our job. Note also that on our ABySS command line we are using the option "np=2" - this represents the number of desired process threads passed to ABySS.
You will further notice that the number of cores requested for "ppn" is +1 more cores than the "np" option. This is because many applications assume that the value passed for the number of cores or threads is in addition to the main thread, so we will actually be using 3 cores (np=2 + the main core process), and must reserve that amount with the job scheduler.
The value "nodes" is the number of machines or compute nodes requested. Here, we are asking that our job be run on one machine. You can get a lot more complicated by running jobs on multiple machines (i.e., increasing the value of "nodes") but a thorough discussion of that is beyond the scope of this tutorial.
You can now submit the job script to the scheduler: