2015-09-21: RNA-Sequencing Tools workshop video recordings are here for your reference: https://www.youtube.com/playlist?list=PL010YiUqY7JSvmEPsQmKLuSQyDy-SYuLO
Table of Contents:
RNA sequencing as a tool
Why are we using RNAseq at all?
1) It's NOT cheaper than microarrays
2) There are fewer tools than microarrays
So why do we use it?
A) You can get MORE information than from a microarray for a good reference.
B) You don't have to have a perfect reference.
C) You can use it to IMPROVE your reference.
D) You don't have a model organism.
We have a wide array of students this week, with organisms from Bacteria to poorly sequenced plants and animals. This course will try to give you some basic tools and workflows to analyze multiple data types on HPCC style resources to get your analysis moving as quickly as possible. You should leave here tomorrow with improved understanding of some of the following:
I) How to use iCER resources to do analysis
II) Some of the more fundamental considerations when setting up an RNA sequencing experiment
III) A set of tools/workflows to use under particular experimental designs
IV) An idea of where and how to find additional tools for future analysis
V) An understanding of what "reproducible research" means in relation to RNAseq, and how to begin the process.
How Does This Webpage Work?
We will have three types of text. Material that is part of a discussion, which will be in plain text, like this:
See what I did there?
We will have text that you can copy and paste into a file:
Expected output will be displayed in "info" boxes:
We will also have commands to copy and paste into a command line. They will look like this:
So, what are we going to do?
RNAseq is a multistep process, and you NEED to have an analysis plan before you start.
The workflow should look something like this:
RNAseq Theory Vs. Practice
Things you need to consider when calculating reads, read length, paired vs. single.
1) Quality of reference (and gene model)
2) Number of samples
3) Statistical relevance
4) Splice variation
5) Genome Size
0) WHAT IS THE GOAL OF THE EXPERIMENT?
The better the model, the less sequencing you have to do.
The more complex the question, the MORE sequencing you have to do.
If you want to stand on the RNAseq alone, you will probably fail.
Human, Mouse, Rat, Arabidopsis, C. Elegans, Drosophila, Bacteria*
Tools well defined.
Sequenced, unfinished genomes, some gene models
Extra work required.
From C. Titus Brown:
|0.05||α (alpha)||Type1 Error||0.05||#NAME?||Power_Error||#NAME?||Samples_needed_per_group|
|0.9||β (beta)||Power||0.9||0.48||Effect Size|
|1.5||σ (sigma)||coefficient of variation||1.5|
|2||Δ (delta)||Effect Size (fold change)||2|
|Calculating Fold Change|
|0.05||α (alpha)||Type1 Error||0.05||#NAME?||Detectable Fold Change in Expression|
|1.5||σ (sigma)||coefficient of variation||1.5||#NAME?||log(d)2|
|50||η||number of samples you have for the experiment||50|
|The user can substitute any number in the green boxes for their own. Using the example values proided from the top table, one would need 99 samples per group. However, if the user changes the coefficient of variation to be 0.6, the number of sample per group drops to 16.|
Let's Get to Work:
1) log on to hpc.msu.edu (ssh -XY email@example.com)
Please note: When you log out and log back in to HPCC, run ./login.sh to enter the working node. YOU CANNOT RUN ANY JOBS FROM GATEWAY NODE
module load powertools
You should see:
If you don't time to use the red sticky.
We are now embarking on our foray into the great and powerful HPCC. PAY NO ATTENTION TO THE MAN BEHIND THE CURTAIN!!!
We are logging in to special nodes devoted to the class. In normal circumstances, you would be doing this work on one of our development nodes. More on this later.
Once the login completes
Now, let's look at our data:
You will see the data files:
What we have are 4 sets of RNA seq data from human for 2 conditions.
A quick word on fastq:
Each read has 4 lines. Name, Sequence, Note, and Quality
So, now we get right to work, right?
Quality Assessment and Quality Control
I would recommend one of the following 2 protocols:
Trimmomatic -> FastQC
FastQC -> Trimmomatic -> FastQC
Here's an example of the outputs:
Here is how we do it:
(Before we start, we're going to make small subsets of the data so that we can finish in a reasonable time frame):
head -n 400000 data/ERR315325_1.fastq > QA/Salivary_Rep1_r1.fastq
head -n 400000 data/ERR315325_2.fastq > QA/Salivary_Rep1_r2.fastq
We will take 100000 read pairs from one of the 4 samples, and use them for our analysis.
But what did we do?
why are we naming this Salivary_rep1_r?.fastq?
We will come back and try this again on other samples.
Now that we have data... Time for some reproducible research:
1) We name things so that we can remember what we did.
2) When you run things, save the commands VERBATIM in a single file.
3) Try to organize your data in a way that you can find them.
you might want to have the following directories for any given analysis:
module load FastQC
Let's look at the results:
OR you can do the whole thing interactively:
Now let's make it better:
module load FaQCs
FaQCs.pl -p "Salivary_Rep1_r1.fastq Salivary_Rep1_r2.fastq" -q 10 -thread 4 -pre salivary_rep1 -d .
So report looks different.
BUT, there's still another problem:
So, we still need trimmomatic:
module load Trimmomatic
java -jar $TRIM/trimmomatic PE -threads 4 salivary_rep1.1.trimmed.fastq salivary_rep1.2.trimmed.fastq salivary_rep1.trim.adapterRemoved.1.fastq salivary_rep1.trim.adapterRemoved.u1.fastq salivary_rep1.trim.adapterRemoved.2.fastq salivary_rep1.trim.adapterRemoved.u2.fastq ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE-2.fa:2:30:10 MINLEN:60
Now, we're getting somewhere. We have trimmed data. We can align it.
Aligning to References
"Reference" is always wrong.
Alignment is always less than perfect
So, what do we do?
module load TopHat2/2.1.0
cd $HOME/RNAseq-model/tophattophat -p 4 \ -G ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf \ --transcriptome-index=$HOME/RNAseq-model/transcriptome \ -o tophat_salivary_rep1 \ ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Sequence/Bowtie2Index/genome \ ~/RNAseq-model/QA/salivary_rep1.trim.adapterRemoved.1.fastq ~/RNAseq-model/QA/salivary_rep1.trim.adapterRemoved.2.fastq
So, what is this doing?
OK, now we should have an output directory:
module load SAMToolssamtools view -c -F 4 tophat_salivary_rep1/accepted_hits.bam
What is a BAM/SAM file format?
How to view a BAM file?
What are all these columns?
For more information on SAM flags and columns: http://genome.sph.umich.edu/wiki/SAM
To find out instantly about SAM flags: https://broadinstitute.github.io/picard/explain-flags.html
module load PySAM/0.6 module load HTSeq/0.6.1htseq-count --format=bam --stranded=no --order=pos \ tophat_salivary_rep1/accepted_hits.bam \ ~/RNAseq-model/Homo_sapiens/Ensembl/GRCh37/Annotation/Genes/genes.gtf > salivary_repl1_counts.txt
Now we have data!!!
What do we do with it?
WE MAKE MORE DATA:
We can repeat all of the steps above, or we can make a script to do it for us. We'll do "both"
THis will open up a text editing window. paste the following:
And... ctrl-X to close and save
And NOW, can we do it again for the other samples?
Analyze Results using EdgeR
There are various expression analysis packages available in the community to name few: "DESeq, edgeR, BaySeq, Cuffdiff, DEGSeq,.."
What is EdgeR?
EdgeR is an R package built around analyzing first micro-array data, then updated for use with NGS/RNAseq packages. It uses fancy statistical methods to perform analysis.
1) Don't use edgeR without sufficient statistical power
2) Don't use edgeR with low coverage
What can statistical analysis show you?
Differential Gene expression. But what does that MEAN?
|Gene Name||Counts (Condition 1)||Counts (Condition 2)|
Given differential gene expression calculation, what can we say about these samples?
A) Gene A is upregulated in Condition1 ?
B) Gene B is downregulated?
(hint, how many reads ARE there in each sample?)
This leads to the need for sophisticated statistical packages to do analysis.
Things that Differential gene expression (DGE) analysis may take into consideration:
1) Relative counts of reads per sample
2) Std. deviation of read counts within sample
3) Total number of average reads per sample (more reads, higher confidence in actual expression level)
Finally, EdgeR gives us a different statistic than you may be used to (it GIVES P value, but ): FDR
FDR= False Discovery rate. This is a measure of the likelihood that the gene listed as significantly differentially regulated is false.
Let's try to use it.
cd edgeRcurl -O http://2014-msu-rnaseq.readthedocs.org/en/latest/_static/lung_saliva.R curl -O http://2014-msu-rnaseq.readthedocs.org/en/latest/_static/subset/lung_repl1_counts.txt curl -O http://2014-msu-rnaseq.readthedocs.org/en/latest/_static/subset/lung_repl2_counts.txt curl -O http://2014-msu-rnaseq.readthedocs.org/en/latest/_static/subset/salivary_repl1_counts.txt curl -O http://2014-msu-rnaseq.readthedocs.org/en/latest/_static/subset/salivary_repl2_counts.txtcd ~/RNAseq-model/edgeR module load R Rscript lung_saliva.R
This is based on the 100K reads we took earlier (in other words, the input files here should be the same as the ones you made already.
Overnight we will try to generate a full list. In the meantime, let's look at the outputs.
Challenge, set up these same charts using FULL set of data.
RNAseq for Semi-Model Organisms
Sequencing is only half the battle. Model organisms have both finished genomes and good transcriptome annotations. Other genomes are not so lucky.
Let's take a look at chickens.
cd module load powertools getexample RNAseq-semimodel
This will generate a new directory, RNAseq-semimodel which holds the data for an experiment to differentiate between male and female blastocysts.
The problem, of course, is that the Chicken genome is not well annotated. We will do the same thing as before (subsampling)
once you have a new prompt, let's get to work:
ls -l ~/RNAseq-semimodel/data/
(Note the samples.description.txt, but what's inside?):
OK, so now, let's get back to work:
cd RNAseq-semimodelgunzip -c data/SRR534005_1.fastq.gz | head -400000 | gzip > female_repl1_R1.fq.gz gunzip -c data/SRR534005_2.fastq.gz | head -400000 | gzip > female_repl1_R2.fq.gzmodule load FastQCfastqc female_repl1_R1.fq.gz fastqc female_repl1_R2.fq.gz
Let's decide how we feel about the replicates. Then let's trim
module load Trimmomaticjava -jar $TRIM/trimmomatic PE female_repl1_R1.fq.gz female_repl1_R2.fq.gz\ female_repl1_R1.qc.fq.gz s1_se female_repl1_R2.qc.fq.gz s2_se \ ILLUMINACLIP:$TRIM/adapters/TruSeq3-PE.fa:2:40:15 \ LEADING:2 TRAILING:2 SLIDINGWINDOW:4:2 MINLEN:25
We are trimming AND removing adapters with trimmomatic this time.
ILLUMINACLIP: is removing adapters.
now, let's write a script to do all of the samples:
ctrl-x to quit and save
Now, we have trimmed samples to do our FIRST alignment. What are we doing?
To build a NEW transcriptional model, we would want to use ALL available reads, in this case we'll use the subsets, but ALL of them:
module load TopHat2/2.0.12 tophat -p 4 \ -G $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Annotation/Genes/genes.gtf \ --transcriptome-index=$HOME/RNAseq-semimodel/tophat/transcriptome \ -o tophat_all \ $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \ female_repl1_R1.qc.fq.gz,male_repl1_R1.qc.fq.gz,female_repl2_R1.qc.fq.gz,male_repl2_R1.qc.fq.gz \ female_repl1_R2.qc.fq.gz,male_repl1_R2.qc.fq.gz,female_repl2_R2.qc.fq.gz,male_repl2_R2.qc.fq.gz
Again, Tophat will build a new transcriptome reference, then align all genes. NOW comes the fun part.
let's look at the statistics:
Now that we have outputs, let's use cufflinks to make a new gtf file:
module load cufflinks/2.2.1
cufflinks -o cuff_all tophat_all/accepted_hits.bam
ls -1 cuff_all/transcripts.gtf > cuff_list.txt
cuffmerge -g $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Annotation/Genes/genes.gtf \
-o cuffmerge_all \
-s $HOME/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/WholeGenomeFasta/genome.fa \
But of course, tools aren't perfect, so we have to remove some useless lines:
python remove-nostrand.py cuffmerge_all/merged.gtf > cuffmerge_all/nostrand.gtf
We have now made a NEW GTF file (after editing it to remove stuff that's not compatible with Tophat).
So, what now?
Map reads to reference using NEW gene models
This should look familiar:
tophat -p 4 \ -G cuffmerge_all/nostrand.gtf \ --transcriptome-index=$HOME/RNAseq-semimodel/tophat/merged \ -o tophat_female_repl1 \ ~/RNAseq-semimodel/reference/Gallus_gallus/UCSC/galGal3/Sequence/Bowtie2Index/genome \ female_repl1_R1.qc.fq.gz female_repl1_R2.qc.fq.gz
Note that we're using a new output directory (merged) for the transcriptome index, and that we are using the original reference genome. Why?
And now to count the reads:
module load HTSeq/0.6.1htseq-count --format=bam --stranded=no --order=pos \ tophat_female_repl1/accepted_hits.bam \ cuffmerge_all/nostrand.gtf > female_repl1_counts.txt
And now we have counted them, and they're ready for edgeR.
Analyze sample 2
OK, let's open a new script:
and paste in:
again, easy to paste, this will analyze up to the sequence counts:
OK, now let's edit that file to make sample3.sh and sample4.sh (using male_repl1 and male_repl2 as inputs and outputs)....
bash sample3.sh ; bash sample4.sh
this is using fancy bash to run BOTH programs sequentially. Just be patient.
Once it's done, we can now run the edgeR.