Blog

2014-RNASeq for Semi-model organisms

Now you are all experts at anlayzing RNAseq data, right?

 

"But Matt, I have X samples, and I need to analyze all of them"

You don't want to run it all by hand, do you?

 

Qsub:

here's the example script (don't type this)

#! /bin/bash -login
#PBS -l walltime=24:00:00,nodes=1:ppn=2,mem=10gb

#PBS -A classbuyin

#change directory to the location of this file

cd $PBS_O_WORKDIR

#run this command

bash chick-tophat2.sh

#print some statistics about your run

qstat -f ${PBS_JOBID}

So, this will just run bash chick-tophat2.sh

What the S*#$ is this stuff at the beginning?


HPCC schedules jobs.  

Why?

  1. More computers than people
  2. Lots of names
  3. Hard to make it fair (sneaky people could use ALL the machines all the time)

So, everyone has to wait in a line.  All the #PBS stuff is a series of things that determine WHICH line, and what it does once your jobs start...

 

Breaking it down:

  1. PBS -l (lowercase L) is the list of computational requirements:
    1. nodes: how many COMPUTERS will this run on (for almost ALL bioinformatics tools, this will be 1)
    2. ppn: how many cores you are using (we will talk about this, but PLEASE try to get this right)
    3. MEM: how much memory you think your process will need (over estimate)
  2. PBS -o is when you want the queue manager to send you an email
    1. a = when job is aborted (finishes with an error, or is killed)
    2. b= when the job begins
    3. e = when job ends

Many other options, you can take alook here, if you start to get savvy:

HPCC Quick Reference Sheet (bottom of the page)

 

So, here's the thing:

 

It's not fair.  You will NEED to be able to think like a computer to be able to estimate your job requirements properly.  SO, you will overestimate.  

 

If you think your estimate is going to be wrong, you can add:

 

qstat -v ${PBS_JOBID} 

 

at the end of the job to see what you used.

 

That kind of output can be used to change your -l reqirements better:

 

Job Id: 21259327.mgr-04.i
Job_Name = process-2.sh
Job_Owner = ctb@dev-intel14-phi.i
resources_used.cput = 05:14:12
resources_used.mem = 3719812kb
resources_used.vmem = 4509756kb
resources_used.walltime = 01:42:27
job_state = R
queue = main_large
server = mgr-04.i
Account_Name = ged-intel11
Checkpoint = u
ctime = Fri Dec 5 08:18:18 2014
Error_Path = dev-intel14-phi.i:/mnt/home/ctb/rnaseq/process2/process-2.sh.
e21259327
exec_host = qml-002/32+qml-002/33+qml-002/34+qml-002/35
exec_port = 15003+15003+15003+15003
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
Mail_Users = titus@idyll.org
mtime = Fri Dec 5 08:19:28 2014
Output_Path = dev-intel14-phi.i:/mnt/home/ctb/rnaseq/process2/process-2.sh
.o21259327
Priority = 0
qtime = Fri Dec 5 08:19:19 2014
Rerunable = False
Resource_List.mem = 100gb
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=4
Resource_List.walltime = 24:00:00
session_id = 38650
etime = Fri Dec 5 08:18:18 2014
submit_args = process-2.sh
start_time = Fri Dec 5 08:19:28 2014
Walltime.Remaining = 80130
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = dev-intel14-phi.i

 

THe important things will be:

Resource_List.mem = 100gb

You asked for 100 GB

resources_used.mem = 3719812kb
resources_used.walltime = 01:42:27

You used 37 GB of RAM

So, you could guess that you only need ~ 50 GB rather than what you requested was 100 GB

Discuss:

 

Now:

you should be able to do the following, and see success:

 

qsub process-1.sh

qsub process-2.sh

qsub process-3.sh

qsub process-4.sh

watch qstat -u <netid>

 

 

RNA-Seq Model Organism 2014

Quick documentation:

 

We are first going to re-grab the examples.

 

Unfortunately, they were set up wrong, so we have to delete the example first:

 

cd ~/

rm -rf RNAseq-model

module load powertools

getexample RNAseq-model

 

Now, we've got some new data.  let's take a look:

cd RNAseq-model/examples

ls

You should see a list like this:

[mscholz@dev-intel14-phi examples]$ ls -l
total 20
-rw-r--r-- 1 mscholz helpdesk 102 Dec 5 09:26 env.sh
-rw-r--r-- 1 mscholz helpdesk 1043 Dec 5 09:42 process-1.sh
-rw-r--r-- 1 mscholz helpdesk 1031 Dec 5 09:42 process-2.sh
-rw-r--r-- 1 mscholz helpdesk 1004 Dec 5 09:42 process-3.sh
-rw-r--r-- 1 mscholz helpdesk 1002 Dec 5 09:42 process-4.sh

Let's take a look:

more process-1.sh

it'll look like this:

#! /bin/bash
#PBS -l walltime=24:00:00,nodes=1:ppn=2,mem=10gb

#You COULD do this, but then you're stuck in that directory, or changing all your files
#cd $HOME/rnaseq/script
cd $PBS_O_WORKDIR

bash env.sh

java -jar $TRIM/trimmomatic PE ../data/ERR315325_1.fastq ../data/ERR315325_2.fastq \
salivary_repl1_R1.qc.fq.gz s1_se salivary_repl1_R2.qc.fq.gz s2_se \
ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \
LEADING:2 TRAILING:2 \
SLIDINGWINDOW:4:2 \
MINLEN:25


tophat -p 4 \
-G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \
--transcriptome-index=$HOME/RNAseq-model/transcriptome \
-o tophat_salivary_repl1 \
~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \
salivary_repl1_R1.qc.fq.gz salivary_repl1_R2.qc.fq.gz

htseq-count --format=bam --stranded=no --order=pos \
tophat_salivary_repl1/accepted_hits.bam \
~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > salivary_repl1_counts.txt

What the H*#$ is this stuff at the beginning?


HPCC schedules jobs.  

Why?

  1. More computers than people
  2. Lots of names
  3. Hard to make it fair (sneaky people could use ALL the machines all the time)

So, everyone has to wait in a line.  All the #PBS stuff is a series of things that determine WHICH line, and what it does once your jobs start...

 

Breaking it down:

  1. PBS -l (lowercase L) is the list of computational requirements:
    1. nodes: how many COMPUTERS will this run on (for almost ALL bioinformatics tools, this will be 1)
    2. ppn: how many cores you are using (we will talk about this, but PLEASE try to get this right)
    3. MEM: how much memory you think your process will need (over estimate)
  2. PBS -o is when you want the queue manager to send you an email
    1. a = when job is aborted (finishes with an error, or is killed)
    2. b= when the job begins
    3. e = when job ends

Many other options, you can take alook here, if you start to get savvy:

HPCC Quick Reference Sheet (bottom of the page)

 

So, here's the thing:

 

It's not fair.  You will NEED to be able to think like a computer to be able to estimate your job requirements properly.  SO, you will overestimate.  

 

If you think your estimate is going to be wrong, you can add:

 

qstat -v ${PBS_JOBID} 

 

at the end of the job to see what you used.

 

That kind of output can be used to change your -l reqirements better:

 

Job Id: 21259327.mgr-04.i
Job_Name = process-2.sh
Job_Owner = ctb@dev-intel14-phi.i
resources_used.cput = 05:14:12
resources_used.mem = 3719812kb
resources_used.vmem = 4509756kb
resources_used.walltime = 01:42:27
job_state = R
queue = main_large
server = mgr-04.i
Account_Name = ged-intel11
Checkpoint = u
ctime = Fri Dec 5 08:18:18 2014
Error_Path = dev-intel14-phi.i:/mnt/home/ctb/rnaseq/process2/process-2.sh.
e21259327
exec_host = qml-002/32+qml-002/33+qml-002/34+qml-002/35
exec_port = 15003+15003+15003+15003
Hold_Types = n
Join_Path = n
Keep_Files = n
Mail_Points = a
Mail_Users = titus@idyll.org
mtime = Fri Dec 5 08:19:28 2014
Output_Path = dev-intel14-phi.i:/mnt/home/ctb/rnaseq/process2/process-2.sh
.o21259327
Priority = 0
qtime = Fri Dec 5 08:19:19 2014
Rerunable = False
Resource_List.mem = 100gb
Resource_List.nodect = 1
Resource_List.nodes = 1:ppn=4
Resource_List.walltime = 24:00:00
session_id = 38650
etime = Fri Dec 5 08:18:18 2014
submit_args = process-2.sh
start_time = Fri Dec 5 08:19:28 2014
Walltime.Remaining = 80130
start_count = 1
fault_tolerant = False
job_radix = 0
submit_host = dev-intel14-phi.i

 

THe important things will be:

Resource_List.mem = 100gb

You asked for 100 GB

resources_used.mem = 3719812kb
resources_used.walltime = 01:42:27

You used 37 GB of RAM

So, you could guess that you only need ~ 50 GB rather than what you requested was 100 GB

Discuss:

 

Now:

you should be able to do the following, and see success:

 

qsub process-1.sh

qsub process-2.sh

qsub process-3.sh

qsub process-4.sh

watch qstat -u <netid>