Child pages
  • 2015-09-21: RNA-Sequencing Tools
Skip to end of metadata
Go to start of metadata

2015-09-21: RNA-Sequencing Tools workshop video recordings are here for your reference: 

RNA Sequencing

Table of Contents:

RNA sequencing as a tool

Why are we using RNAseq at all? 

1) It's NOT cheaper than microarrays

2) There are fewer tools than microarrays

So why do we use it? 

A) You can get MORE information than from a microarray for a good reference.

B) You don't have to have a perfect reference.

C) You can use it to IMPROVE your reference.

D) You don't have a model organism.


We have a wide array of students this week, with organisms from Bacteria to poorly sequenced plants and animals.  This course will try to give you some basic tools and workflows to analyze multiple data types on HPCC style resources to get your analysis moving as quickly as possible.  You should leave here tomorrow with improved understanding of some of the following:

I) How to use iCER resources to do analysis

II) Some of the more fundamental considerations when setting up an RNA sequencing experiment

III) A set of tools/workflows to use under particular experimental designs

IV) An idea of where and how to find additional tools for future analysis

V) An understanding of what "reproducible research" means in relation to RNAseq, and how to begin the process.



How Does This Webpage Work?

We will have three types of text.  Material that is part of a discussion, which will be in plain text, like this: 


See what I did there?

We will have text that you can copy and paste into a file:

for my $i in (@temp){
  print "$i

str = raw_input("Enter your input: ");
print "Received input is : ", str

Expected output will be displayed in "info" boxes:

Recieved input is : something you typed


We will also have commands to copy and paste into a command line.  They will look like this:


ls -latrh



Easy, right?


So, what are we going to do?

RNASeq workflows


RNAseq is a multistep process, and you NEED to have an analysis plan before you start. 


The workflow should look something like this:



RNAseq Theory Vs. Practice

Sequencing Strategy:

Things you need to consider when calculating reads, read length, paired vs. single.

1) Quality of reference (and gene model)

2) Number of samples

3) Statistical relevance

4) Splice variation

5) Genome Size



The better the model, the less sequencing you have to do.

The more complex the question, the MORE sequencing you have to do. 

If you want to stand on the RNAseq alone, you will probably fail.


Model organisms:

Human, Mouse, Rat, Arabidopsis, C. Elegans, Drosophila, Bacteria*

Tools well defined.


Semi-Model organisms:

Sequenced, unfinished genomes, some gene models

Extra work required.


Everything else:

Good luck!



From C. Titus Brown:

Other papers:


Statistical Calculator

Calculating n
Example valuesSymbolDescriptionValue
0.05α (alpha)Type1 Error0.05 3.24Power_Error 99Samples_needed_per_group
0.9β (beta)Power0.9 0.48Effect Size
200µ (mu)Counts200 2.255Noise
1.5σ (sigma)coefficient of variation1.5
2Δ (delta)Effect Size (fold change)2
Calculating Fold Change
Example valuesSymbolDescriptionValue
0.05α (alpha)Type1 Error0.05 2.672997085Detectable Fold Change in Expression
0.9β (beta)Power0.9 3.24Validity
20µ (mu)Counts20 2.30Noise
1.5σ (sigma)coefficient of variation1.5 0.97log(d)2
50ηnumber of samples you have for the experiment50
Modifiable parameter
End result
The user can substitute any number in the green boxes for their own. Using the example values proided from the top table, one would need 99 samples per group. However, if the user changes the coefficient of variation to be 0.6, the number of sample per group drops to 16.

Let's Get to Work:

1) log on to (ssh -XY

Connecting to the HPCC

2) Type:

module load powertools

getexample RNAseq-model

cd RNAseq-model




You should see:

[mscholz@dev-intel07 ~]$ getexample RNAseq-model
`/opt/software/powertools/share/examples//RNAseq-model' -> `./RNAseq-model'
`/opt/software/powertools/share/examples//RNAseq-model/data' -> `./RNAseq-model/data'
`/opt/software/powertools/share/examples//RNAseq-model/Homo_sapiens' -> `./RNAseq-model/Homo_sapiens'
`/opt/software/powertools/share/examples//RNAseq-model/' -> `./RNAseq-model/'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome' -> `./RNAseq-model/transcriptome'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.1.bt2' -> `./RNAseq-model/transcriptome/genes.1.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.2.bt2' -> `./RNAseq-model/transcriptome/genes.2.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.3.bt2' -> `./RNAseq-model/transcriptome/genes.3.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.4.bt2' -> `./RNAseq-model/transcriptome/genes.4.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.fa' -> `./RNAseq-model/transcriptome/genes.fa'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.gff' -> `./RNAseq-model/transcriptome/genes.gff'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.rev.1.bt2' -> `./RNAseq-model/transcriptome/genes.rev.1.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.rev.2.bt2' -> `./RNAseq-model/transcriptome/genes.rev.2.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/examples' -> `./RNAseq-model/examples'
`/opt/software/powertools/share/examples//RNAseq-model/examples/' -> `./RNAseq-model/examples/'
`/opt/software/powertools/share/examples//RNAseq-model/examples/' -> `./RNAseq-model/examples/'
`/opt/software/powertools/share/examples//RNAseq-model/examples/' -> `./RNAseq-model/examples/'
`/opt/software/powertools/share/examples//RNAseq-model/examples/' -> `./RNAseq-model/examples/'
`/opt/software/powertools/share/examples//RNAseq-model/examples/' -> `./RNAseq-model/examples/'
[mscholz@dev-intel07 ~]$ cd RNAseq-model/
[mscholz@dev-intel07 RNAseq-model]$ ls
data examples Homo_sapiens transcriptome
[mscholz@dev-intel07 RNAseq-model]$ ./
qsub: waiting for job 22491825.mgr-04.i to start


If you don't time to use the red sticky.


We are now embarking on our foray into the great and powerful HPCC. PAY NO ATTENTION TO THE MAN BEHIND THE CURTAIN!!!


We are logging in to special nodes devoted to the class.  In normal circumstances, you would be doing this work on one of our development nodes.  More on this later.


Once the login completes

Now, let's look at our data:

cd RNAseq-model

 ls -l


[mscholz@css-032 RNAseq-model]$ ls -l
total 10
lrwxrwxrwx 1 mscholz staff-np 52 Feb 23 19:39 data -> /mnt/research/common-data/Examples/RNASeq-Model/data
drwxr-xr-x 2 mscholz staff-np 7 Feb 23 19:40 examples
lrwxrwxrwx 1 mscholz staff-np 60 Feb 23 19:39 Homo_sapiens -> /mnt/research/common-data/Examples/RNASeq-Model/Homo_sapiens
lrwxrwxrwx 1 mscholz staff-np 56 Feb 23 19:39 -> /mnt/research/common-data/Examples/RNASeq-Model/
drwxr-xr-x 2 mscholz staff-np 10 Feb 23 19:40 transcriptome

cd data


You will see the data files:

ERR315325_1.fastq ERR315326_1.fastq ERR315382_1.fastq ERR315424_1.fastq
ERR315325_2.fastq ERR315326_2.fastq ERR315382_2.fastq ERR315424_2.fastq

What we have are 4 sets of RNA seq data from human for 2 conditions.  


A quick word on fastq:

Each read has 4 lines.  Name, Sequence, Note, and Quality 

@ERR315325.1 HISEQ:79:H0CC7ADXX:2:1101:1238:1994/1

So, now we get right to work, right?



Quality Assessment and Quality Control

I would recommend one of the following 2 protocols:

Trimmomatic -> FastQC

FastQC -> Trimmomatic -> FastQC

Here's an example of the outputs:


Here is how we do it:

(Before we start, we're going to make small subsets of the data so that we can finish in a reasonable time frame):


cd ~/RNAseq-model

mkdir QA

head -n 400000 data/ERR315325_1.fastq > QA/Salivary_Rep1_r1.fastq

head -n 400000 data/ERR315325_2.fastq > QA/Salivary_Rep1_r2.fastq


We will take 100000 read pairs from one of the 4 samples, and use them for our analysis.  

But what did we do?

why are we naming this Salivary_rep1_r?.fastq?


We will come back and try this again on other samples.


Now that we have data...  Time for some reproducible research:

1) We name things so that we can remember what we did.

2) When you run things, save the commands VERBATIM in a single file.  

3) Try to organize your data in a way that you can find them.

you might want to have the following directories for any given analysis:







cd ~/RNAseq-model

cd QA

module load FastQC/0.11.2

fastqc Salivary_Rep1_r1.fastq

fastqc Salivary_Rep1_r2.fastq


Let's look at the results:

firefox Salivary_Rep1_r1_fastqc.html

OR you can do the whole thing interactively:



Now let's make it better:

module load FaQCs/1.3 -p "Salivary_Rep1_r1.fastq Salivary_Rep1_r2.fastq" -q 10 -thread 4 -pre salivary_rep1 -d .

firefox salivary_rep1_qc_report.pdf


So report looks different.

BUT, there's still another problem: 

So, we still need trimmomatic:

module load trimmomatic

java -jar $TRIM/trimmomatic  PE -threads 4 salivary_rep1.1.trimmed.fastq salivary_rep1.2.trimmed.fastq salivary_rep1.trim.adapterRemoved.1.fastq salivary_rep1.trim.adapterRemoved.u1.fastq salivary_rep1.trim.adapterRemoved.2.fastq salivary_rep1.trim.adapterRemoved.u2.fastq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE-2.fa:2:30:10 MINLEN:60

fastqc salivary_rep1.trim.adapterRemoved.1.fastq

firefox salivary_rep1.trim.adapterRemoved.1_fastqc.html


Now, we're getting somewhere.  We have trimmed data.  We can align it.

Aligning to References


"Reference" is always wrong.

Alignment is always less than perfect


So, what do we do?


module load tophat2/2.0.12

tophat -p 4 \
    -G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \
    --transcriptome-index=$HOME/RNAseq-model/transcriptome \
    -o tophat_salivary_rep1 \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \
     salivary_rep1.trim.adapterRemoved.1.fastq  salivary_rep1.trim.adapterRemoved.2.fastq

So, what is this doing?




OK, now we should have an output directory:

ls tophat_salivary_rep1

samtools view -c -F 4 tophat_salivary_rep1/accepted_hits.bam


module load PySAM/0.6
module load HTSeq/0.6.1
htseq-count --format=bam --stranded=no --order=pos \
    tophat_salivary_rep1/accepted_hits.bam \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > salivary_repl1_counts.txt

Now we have data!!!


What do we do with it?  




We can repeat all of the steps above, or we can make a script to do it for us.  We'll do "both"


cd ~/RNAseq-model


THis will open up a text editing window.  paste the following:


module load FaQCs/1.3
module load FastQC
module load trimmomatic
module load tophat2/2.0.12
module load HTSeq
cd ~/RNAseq-model
cd ~/RNAseq-model
mkdir QA
head -n 400000 data/ERR315326_1.fastq > QA/Lung_Rep1.r1.fastq
head -n 400000 data/ERR315326_2.fastq > QA/Lung_Rep1.r2.fastq
cd QA
fastqc Lung_Rep1.r1.fastq
fastqc Lung_Rep1.r2.fastq -p "Lung_Rep1.r1.fastq Lung_Rep1.r2.fastq" -q 10 -min_L 60 -d . -pre lung_rep1 
java -jar $TRIM/trimmomatic PE -threads 4 lung_rep1.1.trimmed.fastq lung_rep1.2.trimmed.fastq lung_rep1.trimmed.adapterRemoved.1.fastq lung_rep1.trimmed.adapterRemoved.u1.fastq lung_rep1.trimmed.adapterRemoved.2.fastq lung_rep1.trimmed.adapterRemoved.u2.fastq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE-2.fa:2:30:10
tophat -p 4 \
    -G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \
    --transcriptome-index=$HOME/RNAseq-model/transcriptome \
    -o tophat_lung_rep1 \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \
     lung_rep1.trimmed.adapterRemoved.1.fastq  lung_rep1.trimmed.adapterRemoved.2.fastq
htseq-count --format=bam --stranded=no --order=pos \
    tophat_lung_rep1/accepted_hits.bam \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > lung_repl1_counts.txt


And... ctrl-X to close and save


and now:


And NOW, can we do it again for the other samples?



Analyze Results using EdgeR

What is EdgeR?

EdgeR is an R package built around analyzing first micro-array data, then updated for use with NGS/RNAseq packages.  It uses fancy statistical methods to perform analysis.  



1) Don't use edgeR without sufficient statistical power

2) Don't use edgeR with low coverage


What can statistical analysis show you?

Differential Gene expression.  But what does that MEAN?  



Gene NameCounts (Condition 1)Counts (Condition 2)


Given differential gene expression calculation, what can we say about these samples?


A) Gene A is upregulated in Condition1 ?

B) Gene B is downregulated?

C) Other?


(hint, how many reads ARE there in each sample?)

This leads to the need for sophisticated statistical packages to do analysis.


Things that Differential gene expression (DGE) analysis may take into consideration:

1) Relative counts of reads per sample

2) Std. deviation of read counts within sample

3) Total number of average reads per sample (more reads, higher confidence in actual expression level)

Finally, EdgeR gives us a different statistic than you may be used to (it GIVES P value, but ):  FDR




FDR= False Discovery rate.  This is a measure of the likelihood that the gene listed as significantly differentially regulated is false.  

Read about it here

Let's try to use it.

cd ~/RNAseq-model/

mkdir edgeR

cd edgeR

curl -O
curl -O
curl -O
curl -O
curl -O
cd ~/RNAseq-model/edgeR
module load R
Rscript lung_saliva.R



This is based on the 100K reads we took earlier (in other words, the input files here should be the same as the ones you made already.

Overnight we will try to generate a full list.  In the meantime, let's look at the outputs.

gv edgeR-MA-plot.pdf


Challenge, set up these same charts using FULL set of data. 


RNAseq for Semi-Model Organisms


Sequencing is only half the battle.  Model organisms have both finished genomes and good transcriptome annotations.  Other genomes are not so lucky.


Let's take a look at chickens.



module load powertools
getexample RNAseq-semimodel

This will generate a new directory, RNAseq-semimodel which holds the data for an experiment to differentiate between male and female blastocysts.  


The problem, of course, is that the Chicken genome is not well annotated.   We will do the same thing as before (subsampling)


once you have a new prompt, let's get to work:


ls -l ~/RNAseq-semimodel/data/


-rw-rw-r-- 1 mscholz common-data 66 Sep 21 20:54 samples.description.txt
-r--r--r-- 1 mscholz common-data 6781517200 Dec 9 2014 SRR534005_1.fastq.gz
-r--r--r-- 1 mscholz common-data 7023515467 Dec 9 2014 SRR534005_2.fastq.gz
-r--r--r-- 1 mscholz common-data 7285848617 Dec 9 2014 SRR534006_1.fastq.gz
-r--r--r-- 1 mscholz common-data 7542383700 Dec 9 2014 SRR534006_2.fastq.gz
-r--r--r-- 1 mscholz common-data 7219923066 Dec 9 2014 SRR536786_1.fastq.gz
-r--r--r-- 1 mscholz common-data 7467116873 Dec 9 2014 SRR536786_2.fastq.gz
-r--r--r-- 1 mscholz common-data 7694614208 Dec 9 2014 SRR536787_1.fastq.gz
-r--r--r-- 1 mscholz common-data 7944043814 Dec 9 2014 SRR536787_2.fastq.gz

(Note the samples.description.txt, but what's inside?):

SRR534005 Female
SRR534006 Female
SRR536786 Male
SRR536787 Male

OK, so now, let's get back to work:

cd RNAseq-semimodel
gunzip -c data/SRR534005_1.fastq.gz | head -400000 | gzip > female_repl1_R1.fq.gz
gunzip -c data/SRR534005_2.fastq.gz | head -400000 | gzip > female_repl1_R2.fq.gz
module load FastQC/0.11.2
fastqc female_repl1_R1.fq.gz
fastqc female_repl1_R2.fq.gz

firefox *.html

Let's decide how we feel about the replicates.  Then let's trim


module load Trimmomatic/0.32
java -jar $TRIM/trimmomatic PE female_repl1_R1.fq.gz female_repl1_R2.fq.gz\
     female_repl1_R1.qc.fq.gz s1_se female_repl1_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \


We are trimming AND removing adapters with trimmomatic this time.   

ILLUMINACLIP: is removing adapters.



is trimming off all bases of Q2 from 3' and 5' end of the reads.  

now, let's write a script to do all of the samples:


gunzip -c ~/RNAseq-semimodel/data/SRR534006_1.fastq.gz | head -400000 | gzip > female_repl2_R1.fq.gz
gunzip -c ~/RNAseq-semimodel/data/SRR534006_2.fastq.gz | head -400000 | gzip > female_repl2_R2.fq.gz

gunzip -c ~/RNAseq-semimodel/data/SRR536786_1.fastq.gz | head -400000 | gzip > male_repl1_R1.fq.gz
gunzip -c ~/RNAseq-semimodel/data/SRR536786_2.fastq.gz | head -400000 | gzip > male_repl1_R2.fq.gz

gunzip -c ~/RNAseq-semimodel/data/SRR536787_1.fastq.gz | head -400000 | gzip > male_repl2_R1.fq.gz
gunzip -c ~/RNAseq-semimodel/data/SRR536787_2.fastq.gz | head -400000 | gzip > male_repl2_R2.fq.gz

java -jar $TRIM/trimmomatic PE female_repl2_R1.fq.gz female_repl2_R2.fq.gz\
     female_repl2_R1.qc.fq.gz s1_se female_repl2_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \

java -jar $TRIM/trimmomatic PE male_repl1_R1.fq.gz male_repl1_R2.fq.gz\
     male_repl1_R1.qc.fq.gz s1_se male_repl1_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \

java -jar $TRIM/trimmomatic PE male_repl2_R1.fq.gz male_repl2_R2.fq.gz\
     male_repl2_R1.qc.fq.gz s1_se male_repl2_R2.qc.fq.gz s2_se \
     ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \

ctrl-x to quit and save



Now, we have trimmed samples to do our FIRST alignment.  What are we doing?  

To build a NEW transcriptional model, we would want to use ALL available reads, in this case we'll use the subsets, but ALL of them:



module load TopHat2/2.0.12

tophat -p 4 \
    -G  $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Annotation/Genes/genes.gtf \
    --transcriptome-index=$HOME/RNAseq-semimodel/tophat/transcriptome \
    -o tophat_all \
    $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
 female_repl1_R1.qc.fq.gz,male_repl1_R1.qc.fq.gz,female_repl2_R1.qc.fq.gz,male_repl2_R1.qc.fq.gz \



Again, Tophat will build a new transcriptome reference, then align all genes.  NOW comes the fun part. 


let's look at the statistics:


less tophat_all/align_summary.txt


Now that we have outputs, let's use cufflinks to make a new gtf file:


module load cufflinks/2.2.1
 cufflinks -o cuff_all tophat_all/accepted_hits.bam
ls -1 cuff_all/transcripts.gtf > cuff_list.txt
cuffmerge -g $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Annotation/Genes/genes.gtf \
    -o cuffmerge_all \
    -s $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \

But of course, tools aren't perfect, so we have to remove some useless lines:


curl -O
 python cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf

W have now made a NEW GTF file (after editing it to remove stuff that's not compatible with Tophat (I know, I know)), and a new fasta file.  

So, what now?

Map reads to reference using NEW gene models

This should look familiar:

tophat -p 4 \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=$HOME/RNAseq-semimodel/tophat/merged \
    -o tophat_female_repl1 \
    ~/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
    female_repl1_R1.qc.fq.gz female_repl1_R2.qc.fq.gz


Note that we're using a new output directory (merged) for the transcriptome index, and that we are using the original reference genome.  Why?


And now to count the reads:

module load HTSeq/0.6.1
htseq-count --format=bam --stranded=no --order=pos \
    tophat_female_repl1/accepted_hits.bam \
    cuffmerge_all/nostrand.gtf > female_repl1_counts.txt


And now we have counted them, and they're ready for edgeR. 


Analyze sample 2


OK, let's open a new script:



and paste in:

module load TopHat2/2.0.12
module load PySAM/0.6
module load HTSeq/0.6.1

# go to the 'rnaseq' directory in my home directory

# now run Tophat!
tophat \
    -G cuffmerge_all/nostrand.gtf \
    --transcriptome-index=$HOME/RNAseq-semimodel/tophat/merged \
    -o tophat_female_repl2 \
    ~/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \
    female_repl2_R1.qc.fq.gz female_repl2_R2.qc.fq.gz

htseq-count --format=bam --stranded=no --order=pos \
    tophat_female_repl2/accepted_hits.bam \
    cuffmerge_all/nostrand.gtf > female_repl2_counts.txt


again, easy to paste, this will analyze A-Z up to the sequence counts:



OK, now let's edit that file to make and (using male_repl1 and male_repl2 as inputs and outputs)....


and run:

bash ; bash

this is using fancy bash to run BOTH programs sequentially.  Just be patient.


Once it's done, we can now run the edgeR.




Last bits:


  • No labels