Child pages
  • 2015-02-24: RNA-Sequencing Tools
Skip to end of metadata
Go to start of metadata

RNA Sequencing

Table of Contents:

RNA sequencing as a tool

Why are we using RNAseq at all? 

1) It's NOT cheaper than microarrays

2) There are fewer tools than microarrays

So why do we use it? 

A) You can get MORE information than from a microarray for a good reference.

B) You don't have to have a perfect reference.

C) You can use it to IMPROVE your reference.

D) You don't have a model organism.


We have a wide array of students this week, with organisms from Bacteria to poorly sequenced plants and animals.  This course will try to give you some basic tools and workflows to analyze multiple data types on HPCC style resources to get your analysis moving as quickly as possible.  You should leave here tomorrow with improved understanding of some of the following:

I) How to use iCER resources to do analysis

II) Some of the more fundamental considerations when setting up an RNA sequencing experiment

III) A set of tools/workflows to use under particular experimental designs

IV) An idea of where and how to find additional tools for future analysis

V) An understanding of what "reproducible research" means in relation to RNAseq, and how to begin the process.



How Does This Webpage Work?

We will have three types of text.  Material that is part of a discussion, which will be in plain text, like this: 


See what I did there?

We will have text that you can copy and paste into a file:

for my $i in (@temp){
  print "$i

str = raw_input("Enter your input: ");
print "Received input is : ", str

Expected output will be displayed in "info" boxes:

Recieved input is : something you typed


We will also have commands to copy and paste into a command line.  They will look like this:


ls -latrh



Easy, right?


So, what are we going to do?

RNASeq workflows


RNAseq is a multistep process, and you NEED to have an analysis plan before you start. 


The workflow should look something like this:



RNAseq Theory Vs. Practice

Sequencing Strategy:

Things you need to consider when calculating reads, read length, paired vs. single.

1) Quality of reference (and gene model)

2) Number of samples

3) Statistical relevance

4) Splice variation

5) Genome Size



The better the model, the less sequencing you have to do.

The more complex the question, the MORE sequencing you have to do. 

If you want to stand on the RNAseq alone, you will probably fail.


Model organisms:

Human, Mouse, Rat, Arabidopsis, C. Elegans, Drosophila, Bacteria*

Tools well defined.


Semi-Model organisms:

Sequenced, unfinished genomes, some gene models

Extra work required.


Everything else:

Good luck!



From C. Titus Brown:


Other papers:



Statistical Calculator

Calculating n
Example valuesSymbolDescriptionValue
0.05α (alpha)Type1 Error0.053.24Power_Error99Samples_needed_per_group
0.9β (beta)Power0.90.48Effect Size
200µ (mu)Counts2002.255Noise
1.5σ (sigma)coefficient of variation1.5
2Δ (delta)Effect Size (fold change)2
Calculating Fold Change
Example valuesSymbolDescriptionValue
0.05α (alpha)Type1 Error0.052.672997085Detectable Fold Change in Expression
0.9β (beta)Power0.93.24Validity
20µ (mu)Counts202.30Noise
1.5σ (sigma)coefficient of variation1.50.97log(d)2
50ηnumber of samples you have for the experiment50
Modifiable parameter
End result
The user can substitute any number in the green boxes for their own. Using the example values proided from the top table, one would need 99 samples per group. However, if the user changes the coefficient of variation to be 0.6, the number of sample per group drops to 16.

Let's Get to Work:

1) log on to hpc.msu.edu (ssh -XY username@hpc.msu.edu)

Connecting to the HPCC

2) Type:

module load powertools

getexample RNAseq-model

cd RNAseq-model




You should see:

[mscholz@dev-intel07 ~]$ getexample RNAseq-model
`/opt/software/powertools/share/examples//RNAseq-model' -> `./RNAseq-model'
`/opt/software/powertools/share/examples//RNAseq-model/data' -> `./RNAseq-model/data'
`/opt/software/powertools/share/examples//RNAseq-model/Homo_sapiens' -> `./RNAseq-model/Homo_sapiens'
`/opt/software/powertools/share/examples//RNAseq-model/login.sh' -> `./RNAseq-model/login.sh'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome' -> `./RNAseq-model/transcriptome'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.1.bt2' -> `./RNAseq-model/transcriptome/genes.1.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.2.bt2' -> `./RNAseq-model/transcriptome/genes.2.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.3.bt2' -> `./RNAseq-model/transcriptome/genes.3.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.4.bt2' -> `./RNAseq-model/transcriptome/genes.4.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.fa' -> `./RNAseq-model/transcriptome/genes.fa'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.gff' -> `./RNAseq-model/transcriptome/genes.gff'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.rev.1.bt2' -> `./RNAseq-model/transcriptome/genes.rev.1.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/transcriptome/genes.rev.2.bt2' -> `./RNAseq-model/transcriptome/genes.rev.2.bt2'
`/opt/software/powertools/share/examples//RNAseq-model/examples' -> `./RNAseq-model/examples'
`/opt/software/powertools/share/examples//RNAseq-model/examples/env.sh' -> `./RNAseq-model/examples/env.sh'
`/opt/software/powertools/share/examples//RNAseq-model/examples/process-2.sh' -> `./RNAseq-model/examples/process-2.sh'
`/opt/software/powertools/share/examples//RNAseq-model/examples/process-3.sh' -> `./RNAseq-model/examples/process-3.sh'
`/opt/software/powertools/share/examples//RNAseq-model/examples/process-4.sh' -> `./RNAseq-model/examples/process-4.sh'
`/opt/software/powertools/share/examples//RNAseq-model/examples/process-1.sh' -> `./RNAseq-model/examples/process-1.sh'
[mscholz@dev-intel07 ~]$ cd RNAseq-model/
[mscholz@dev-intel07 RNAseq-model]$ ls
data examples Homo_sapiens login.sh transcriptome
[mscholz@dev-intel07 RNAseq-model]$ ./login.sh
qsub: waiting for job 22491825.mgr-04.i to start


If you don't time to use the red sticky.


We are now embarking on our foray into the great and powerful HPCC. PAY NO ATTENTION TO THE MAN BEHIND THE CURTAIN!!!


We are logging in to special nodes devoted to the class.  In normal circumstances, you would be doing this work on one of our development nodes.  More on this later.


Once the login completes

Now, let's look at our data:

cd RNAseq-model

 ls -l


[mscholz@css-032 RNAseq-model]$ ls -l
total 10
lrwxrwxrwx 1 mscholz staff-np 52 Feb 23 19:39 data -> /mnt/research/common-data/Examples/RNASeq-Model/data
drwxr-xr-x 2 mscholz staff-np 7 Feb 23 19:40 examples
lrwxrwxrwx 1 mscholz staff-np 60 Feb 23 19:39 Homo_sapiens -> /mnt/research/common-data/Examples/RNASeq-Model/Homo_sapiens
lrwxrwxrwx 1 mscholz staff-np 56 Feb 23 19:39 login.sh -> /mnt/research/common-data/Examples/RNASeq-Model/login.sh
drwxr-xr-x 2 mscholz staff-np 10 Feb 23 19:40 transcriptome

cd data


You will see the data files:

ERR315325_1.fastq ERR315326_1.fastq ERR315382_1.fastq ERR315424_1.fastq
ERR315325_2.fastq ERR315326_2.fastq ERR315382_2.fastq ERR315424_2.fastq

What we have are 4 sets of RNA seq data from human for 2 conditions.  


A quick word on fastq:

Each read has 4 lines.  Name, Sequence, Note, and Quality 

@ERR315325.1 HISEQ:79:H0CC7ADXX:2:1101:1238:1994/1

So, now we get right to work, right?



Quality Assessment and Quality Control

I would recommend one of the following 2 protocols:

Trimmomatic -> FastQC

FastQC -> Trimmomatic -> FastQC

Here's an example of the outputs:


Here is how we do it:

(Before we start, we're going to make small subsets of the data so that we can finish in a reasonable time frame):


cd ~/RNAseq-model

mkdir QA

head -n 400000 data/ERR315325_1.fastq > QA/Salivary_Rep1_r1.fastq

head -n 400000 data/ERR315325_1.fastq > QA/Salivary_Rep1_r2.fastq


We will take 100000 read pairs from one of the 4 samples, and use them for our analysis.  

But what did we do?

why are we naming this Salivary_rep1_r?.fastq?


We will come back and try this again on other samples.


Now that we have data...  Time for some reproducible research:

1) We name things so that we can remember what we did.

2) When you run things, save the commands VERBATIM in a single file.  

3) Try to organize your data in a way that you can find them.

you might want to have the following directories for any given analysis:







cd ~/RNAseq-model

cd QA

module load FastQC/0.11.2

fastqc Salivary_Rep1_r1.fastq

fastqc Salivary_Rep1_r2.fastq


Let's look at the results:

firefox Salivary_Rep1_r1_fastqc.html

OR you can do the whole thing interactively:



Now let's make it better:

module load FaQCs/1.3

FaQCs.pl -p "Salivary_Rep1_r1.fastq Salivary_Rep1_r2.fastq" -q 10 -thread 4 -pre salivary_rep1 -d .

evince salivary_rep1_qc.pdf


So report looks different.

BUT, there's still another problem: 

So, we still need trimmomatic:

module load trimmomatic

java -jar $TRIM/trimmomatic  PE -threads 4 salivary_rep1.1.trimmed.fastq salivary_rep1.2.trimmed.fastq salivary_rep1.trim.adapterRemoved.1.fastq salivary_rep1.trim.adapterRemoved.u1.fastq salivary_rep1.trim.adapterRemoved.2.fastq salivary_rep1.trim.adapterRemoved.u2.fastq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE-2.fa:2:30:10 MINLEN:60



Now, we're getting somewhere.  We have trimmed data.  We can align it.

Aligning to References


"Reference" is always wrong.

Alignment is always less than perfect


So, what do we do?


module load tophat2/2.0.12

tophat -p 4 \
    -G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \
    --transcriptome-index=$HOME/RNAseq-model/transcriptome \
    -o tophat_salivary_rep1 \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \
     salivary_rep1.trim.adapterRemoved.1.fastq  salivary_rep1.trim.adapterRemoved.2.fastq

So, what is this doing?




OK, now we should have an output directory:

ls tophat_salivary_rep1

samtools view -c -F 4 tophat_salivary_rep1/accepted_hits.bam


module load PySAM/0.6
module load HTSeq/0.6.1
htseq-count --format=bam --stranded=no --order=pos \
    tophat_salivary_rep1/accepted_hits.bam \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > salivary_repl1_counts.txt

Now we have data!!!


What do we do with it?  




We can repeat all of the steps above, or we can make a script to do it for us.  We'll do "both"


cd ~/RNAseq-model

nano sample2.sh

THis will open up a text editing window.  paste the following:


module load FaQCs/1.3
module load FastQC
module load trimmomatic
module load tophat2/2.0.12
module load HTSeq
cd ~/RNAseq-model
cd ~/RNAseq-model
mkdir QA
head -n 400000 data/ERR315326_1.fastq > QA/Lung_Rep1.r1.fastq
head -n 400000 data/ERR315326_2.fastq > QA/Lung_Rep1.r2.fastq
cd QA
fastqc Lung_Rep1.r1.fastq
fastqc Lung_Rep1.r2.fastq
FaQCs.pl -p "Lung_Rep1.r1.fastq Lung_Rep1.r2.fastq" -q 10 -min_L 60 -d . -pre lung_rep1 
java -jar $TRIM/trimmomatic PE -threads 4 lung_rep1.1.trimmed.fastq lung_rep1.2.trimmed.fastq lung_rep1.trimmed.adapterRemoved.1.fastq lung_rep1.trimmed.adapterRemoved.u1.fastq lung_rep1.trimmed.adapterRemoved.2.fastq lung_rep1.trimmed.adapterRemoved.u2.fastq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE-2.fa:2:30:10
tophat -p 4 \
    -G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \
    --transcriptome-index=$HOME/RNAseq-model/transcriptome \
    -o tophat_lung_rep1 \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \
     lung_rep1.trimmed.adapterRemoved.1.fastq  lung_rep1.trimmed.adapterRemoved.2.fastq
htseq-count --format=bam --stranded=no --order=pos \
    tophat_salivary_rep1/accepted_hits.bam \
    ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > lung_repl1_counts.txt


And... ctrl-X to close and save


and now:

bash sample2.sh

And NOW, can we do it again for the other samples?



  • No labels