RNA sequencing as a tool
Why are we using RNAseq at all?
1) It's NOT cheaper than microarrays
2) There are fewer tools than microarrays
So why do we use it?
A) You can get MORE information than from a microarray for a good reference.
B) You don't have to have a perfect reference.
C) You can use it to IMPROVE your reference.
D) You don't have a model organism.
We have a wide array of students this week, with organisms from Bacteria to poorly sequenced plants and animals. This course will try to give you some basic tools and workflows to analyze multiple data types on HPCC style resources to get your analysis moving as quickly as possible. You should leave here tomorrow with improved understanding of some of the following:
I) How to use iCER resources to do analysis
II) Some of the more fundamental considerations when setting up an RNA sequencing experiment
III) A set of tools/workflows to use under particular experimental designs
IV) An idea of where and how to find additional tools for future analysis
V) An understanding of what "reproducible research" means in relation to RNAseq, and how to begin the process.
How Does This Webpage Work?
We will have three types of text. Material that is part of a discussion, which will be in plain text, like this:
See what I did there?
We will have text that you can copy and paste into a file:
Expected output will be displayed in "info" boxes:
We will also have commands to copy and paste into a command line. They will look like this:
So, what are we going to do?
RNAseq is a multistep process, and you NEED to have an analysis plan before you start.
The workflow should look something like this:
RNAseq Theory Vs. Practice
Things you need to consider when calculating reads, read length, paired vs. single.
1) Quality of reference (and gene model)
2) Number of samples
3) Statistical relevance
4) Splice variation
5) Genome Size
0) WHAT IS THE GOAL OF THE EXPERIMENT?
The better the model, the less sequencing you have to do.
The more complex the question, the MORE sequencing you have to do.
If you want to stand on the RNAseq alone, you will probably fail.
Human, Mouse, Rat, Arabidopsis, C. Elegans, Drosophila, Bacteria*
Tools well defined.
Sequenced, unfinished genomes, some gene models
Extra work required.
From C. Titus Brown:
|0.05||α (alpha)||Type1 Error||0.05||#NAME?||Power_Error||#NAME?||Samples_needed_per_group|
|0.9||β (beta)||Power||0.9||0.48||Effect Size|
|1.5||σ (sigma)||coefficient of variation||1.5|
|2||Δ (delta)||Effect Size (fold change)||2|
|Calculating Fold Change|
|0.05||α (alpha)||Type1 Error||0.05||#NAME?||Detectable Fold Change in Expression|
|1.5||σ (sigma)||coefficient of variation||1.5||#NAME?||log(d)2|
|50||η||number of samples you have for the experiment||50|
|The user can substitute any number in the green boxes for their own. Using the example values proided from the top table, one would need 99 samples per group. However, if the user changes the coefficient of variation to be 0.6, the number of sample per group drops to 16.|
Let's Get to Work:
1) log on to hpc.msu.edu (ssh -XY email@example.com)
module load powertools
You should see:
If you don't time to use the red sticky.
We are now embarking on our foray into the great and powerful HPCC. PAY NO ATTENTION TO THE MAN BEHIND THE CURTAIN!!!
We are logging in to special nodes devoted to the class. In normal circumstances, you would be doing this work on one of our development nodes. More on this later.
Once the login completes
Now, let's look at our data:
You will see the data files:
What we have are 4 sets of RNA seq data from human for 2 conditions.
A quick word on fastq:
Each read has 4 lines. Name, Sequence, Note, and Quality
So, now we get right to work, right?
Quality Assessment and Quality Control
I would recommend one of the following 2 protocols:
Trimmomatic -> FastQC
FastQC -> Trimmomatic -> FastQC
Here's an example of the outputs:
Here is how we do it:
(Before we start, we're going to make small subsets of the data so that we can finish in a reasonable time frame):
head -n 400000 data/ERR315325_1.fastq > QA/Salivary_Rep1_r1.fastq
head -n 400000 data/ERR315325_1.fastq > QA/Salivary_Rep1_r2.fastq
We will take 100000 read pairs from one of the 4 samples, and use them for our analysis.
But what did we do?
why are we naming this Salivary_rep1_r?.fastq?
We will come back and try this again on other samples.
Now that we have data... Time for some reproducible research:
1) We name things so that we can remember what we did.
2) When you run things, save the commands VERBATIM in a single file.
3) Try to organize your data in a way that you can find them.
you might want to have the following directories for any given analysis:
module load FastQC/0.11.2
Let's look at the results:
OR you can do the whole thing interactively:
Now let's make it better:
module load FaQCs/1.3
FaQCs.pl -p "Salivary_Rep1_r1.fastq Salivary_Rep1_r2.fastq" -q 10 -thread 4 -pre salivary_rep1 -d .
So report looks different.
BUT, there's still another problem:
So, we still need trimmomatic:
module load trimmomatic
java -jar $TRIM/trimmomatic PE -threads 4 salivary_rep1.1.trimmed.fastq salivary_rep1.2.trimmed.fastq salivary_rep1.trim.adapterRemoved.1.fastq salivary_rep1.trim.adapterRemoved.u1.fastq salivary_rep1.trim.adapterRemoved.2.fastq salivary_rep1.trim.adapterRemoved.u2.fastq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE-2.fa:2:30:10 MINLEN:60
Now, we're getting somewhere. We have trimmed data. We can align it.
Aligning to References
"Reference" is always wrong.
Alignment is always less than perfect
So, what do we do?
module load tophat2/2.0.12tophat -p 4 \ -G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \ --transcriptome-index=$HOME/RNAseq-model/transcriptome \ -o tophat_salivary_rep1 \ ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \ salivary_rep1.trim.adapterRemoved.1.fastq salivary_rep1.trim.adapterRemoved.2.fastq
So, what is this doing?
OK, now we should have an output directory:
ls tophat_salivary_rep1samtools view -c -F 4 tophat_salivary_rep1/accepted_hits.bam
module load PySAM/0.6 module load HTSeq/0.6.1htseq-count --format=bam --stranded=no --order=pos \ tophat_salivary_rep1/accepted_hits.bam \ ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > salivary_repl1_counts.txt
Now we have data!!!
What do we do with it?
WE MAKE MORE DATA:
We can repeat all of the steps above, or we can make a script to do it for us. We'll do "both"
THis will open up a text editing window. paste the following:
And... ctrl-X to close and save
And NOW, can we do it again for the other samples?