Space Shortcuts

  • 2015-10-01: Genome Mapping and Variant Analysis, Dharanya Sampath
Skip to end of metadata
Go to start of metadata

Genome Mapping and variant analysis

Table of Contents:

Genome Mapping

1) A reference genome or even a close relative is available 

2) Lot of open source tools available to map next-gen sequencing reads

 

So when do we use it?

A) You can get MORE information on your reads for a good reference.

B) Using SNP database to annotate and filter variants

C) You can look for GENES that are responsible for a specific trait

D) Genome Wide Association Studies (GWAS)

E) TARGET specific part in the genome such as Exome-capture

GOALS

 Here we have people from mixed background working on organisms from Bacteria to poorly sequenced plants and animals. This course will try to give you some basic tools and workflows to analyze multiple data types on HPCC style resources to get your analysis moving as quickly as possible.  You should leave here with improved understanding of some of the following:

I) How to use iCER resources to do analysis

II) A set of tools/workflows for variant calling

III) An understanding of where to find resources and databases

 

How Does This Webpage Work?

We will have three types of text.  Material that is part of a discussion, which will be in plain text, like this: 

Hello.

See what I did there?

We will have text that you can copy and paste into a file:

Expected output will be displayed in "info" boxes:

Icon
Recieved input is : something you typed

 

We will also have commands to copy and paste into a command line.  They will look like this:

 

ls -latrh

 

Easy, right?

 

So, what are we going to do?

Mapping and variant calling workflows

 

The workflow should look something like this:

 

For a more detailed picture of the workflow see here: Variant calling workflow


Model organisms:

Human, Mouse, Rat, Arabidopsis, Rice, Drosophila

Tools and resources well defined.


Semi-Model organisms:

Sequenced, unfinished genomes, maybe missing a SNP database to compare

Extra work required.

 

Where to find your reference genome?

  • Microbes (<10Mb): download FASTA containing in sequence and/or GenBank/EMBL/ GFF flat files encapsulating both sequence and features.
  • Macrobes (>100Mb): download specific consortium "build" of reference (Ex: hg19), consisting of FASTA, and various files used to construct a database of feature (PTT, GFF). 
  • Non-model organisms: build your own? de novo assembly (outside scope of course) 

Links:

From University of Texas at Austin:

https://wikis.utexas.edu/display/bioiteam/Genome+Variant+Analysis+Course+2014

Other papers:

http://bib.oxfordjournals.org/content/15/2/256

http://genome.cshlp.org/content/20/9/1297

http://onlinelibrary.wiley.com/doi/10.1002/0471250953.bi1110s43/abstract

 

Let's Get to Work:

1) log on to hpc.msu.edu (ssh -XY username@hpc.msu.edu)

Connecting to the HPCC

2) Type:

module load powertools

getexample SNPanalysis-model

cd SNPanalysis-model

./login.sh

 

Please note: When you log out and log back in to HPCC, run ./login.sh to enter the working node. YOU CANNOT RUN ANY JOBS FROM GATEWAY NODE

You should see:

Icon

[sampathd@dev-intel10 ~]$ getexample SNPanalysis-model

[sampathd@dev-intel10 sampathd]$ cd SNPanalysis-model/

[sampathd@dev-intel10 SNPanalysis-model]$ ls
data login.sh reference

[sampathd@dev-intel10 SNPanalysis-model]$ ./login.sh
qsub: waiting for job 22491825.mgr-04.i to start

 

If you don't see this, time to use the red sticky.

 

We are logging in to special nodes devoted to the class.  In normal circumstances, you would be doing this work on one of our development nodes.  More on this later.

 

Once the login completes

Now, let's look at our data:

cd SNPanalysis-model

 ls -l

 

Icon

[sampathd@css-032 SNPanalysis-model]$ ls -l

total 12
drwxrwxr-x 2 sampathd staff-np 4096 Sep 30 17:41 data
-rwxrwxr-x 1 sampathd staff-np 96 Sep 30 17:41 login.sh
drwxrwxr-x 2 sampathd staff-np 4096 Sep 30 17:41 reference

cd data

ls

You will see the data files:

Icon

NA12878_1.fastq.gz NA12878_2.fastq.gz

NA12891_1.fastq.gz NA12891_2.fastq.gz

NA12892_1.fastq.gz NA12892_2.fastq.gz

What we have are 6 sets of paired-end Illumina reads data from human

 

A quick word on fastq:

Each read has 4 lines.  Name, Sequence, Note, and Quality 

Icon

@ERR315325.1 HISEQ:79:H0CC7ADXX:2:1101:1238:1994/1
CAANTATGAAGTTCCTTGTCTTTGCCTTCATCTTGGCTCTCATGGTTTCCATGATTGGAGCTGATTCATCTGAAGAGAAATTTTTGCGTAGAATTGGAAGA
+
<00#0<<BFFFFF<FFBBBFFBFFFFII<FBBF<FFFBBFBBFFFBFBFFFFF<BBFF<B0BFBFI<B'7BF<B<BFFFFFFFBF<<0'<<BB<<BBBBB#


So, now we get right to work, right?

 

Wrong.

Quality Assessment and Quality Control

I would recommend one of the following 2 protocols:

 

Trimmomatic -> FastQC

FastQC -> Trimmomatic -> FastQC

Here's an example of the outputs:

 

Here is how we do it:

(Before we start, we're going to make small subsets of the data so that we can finish in a reasonable time frame):

 

cd ~/SNPanalysis-model

cd data

zcat NA12878_1.fastq.gz | head -n 1000000 > sample_NA12878_1.fastq

zcat NA12878_2.fastq.gz | head -n 1000000 > sample_NA12878_2.fastq

 

We will take 100000 read pairs from one human sample, and use them for our analysis.  


We will come back and try this again on other samples.

 

Now that we have data...  Time for some reproducible research:

1) We name things so that we can remember what we did.

2) When you run things, save the commands VERBATIM in a single file.  

3) Try to organize your data in a way that you can find them.

you might want to have the following directories for any given analysis:

Icon

data/

bwa/

SAMTools/

GATK/

 

if not,

mkdir bwa GATK SAMTools

 

Let's do quality check on our data:

cd ~/SNPanalysis-model

cd data

module load FastQC

fastqc sample_NA12878_1.fastq

fastqc sample_NA12878_2.fastq

 

Let's look at the results:

firefox sample_NA12878_1_fastqc.html

OR you can do the whole thing interactively:

fastqc&

 

If you have overrepresented sequences or adapters to trim, use Trimmomatic tool similar to the this tutorial: 2015-09-21: RNA-Sequencing Tools

 

Now, we're getting somewhere. We can align it.

Mapping reads to reference genome


Mapping determines a "seed" position where the read shares a subsequence with the reference. But, is this the best match?

Alignment starts with the seed and determines how the read is best aligned on a base-by-base basis around the seed.


Seed→Alignment score→Mapping quality

From reads to alignments

  1. Build index for your reference genome(specific to the mapping tool used)
  2. Run mapping tool
  3. Convert SAM -> BAM if the tool doesn't already support it
  4. Sort the BAM file 
  5. Build BAM index for sorted BAM alignment file

 

Pitfalls:

"Reference" is always wrong.

Alignment is always less than perfect

Large reference may take longer time to map

Shorter reads will take longer than usual as it tries to map at many places in the reference genome

 

Let's try mapping our sub-sampled set:

 

 

To save time, I have already constructed bwa index for the reference genome. For the course, I am choosing Human Chromosome 20 to speed up the process. 

 

I used the following command to build index:  module load bwa; bwa index chr20.fa chr20

 

Now we have a reference index, so let's try running bwa mapper and sort, index the BAM file

module load SAMTools

module load bwa

cd ~/SNPanalysis-model/bwa


bwa mem -t 4 ~/SNPanalysis-model/reference/chr20.fa \

                       ~/SNPanalysis-model/data/sample_NA12878_1.fastq ~/SNPanalysis-model/data/sample_NA12878_2.fastq \

                       | samtools view - -bS -o sample_NA12878_bwa.bam

 

samtools sort sample_NA12878_bwa.bam sorted_sample_NA12878_bwa

 

samtools index sorted_sample_NA12878_bwa.bam

 

So, what is this doing?

 

What is a BAM/SAM file format? 

 

Icon

SRR030257.264529 99 NC_012967 1521 29 34M2S = 1564 79 CTGGCCATTATCTCGGTGGTAGGACATGGCATGCCCAAAAAA;AA;AAAAAA??A%.;?&'3735',()0*, XT:A:M22NM:i:322SM:i:2922AM:i:2922XM:i:322XO:i:022XG:i:022MD:Z:23T0G4T42

SRR030257.2669090 147 NC_012967 1521 60 36M = 1458 Y99 CTGGCCATTATCTCGGTGGTAGGTGATGGTATGCGC<<9:<<AAAAAAAAAAAAAAAAAAAAAAAAAAAAA XT:A:U22NM:i:022SM:i:3722AM:i:3722X0:i:122X1:i:022XM:i:022XO:i:022XG:i:022MD:Z:36

 

How to view a BAM file?

What are all these columns?

For more information on SAM flags and columns: http://genome.sph.umich.edu/wiki/SAM

To find out instantly about SAM flags: https://broadinstitute.github.io/picard/explain-flags.html

 

OK, now we should have an output BAM file:

module load SAMTools

samtools view sample_NA12878_bwa.bam | less -S


Now we have data!!!

What do we do with it?  


Data visualization

 

Can you visualize the BAM file in a way that it makes sense? Yes, of course!!

There are tools like IGV, Tablet, etc developed especially for viewing BAM files

 

module load IGV

igv.sh

 

 

Sample IGV screenshot: 

 

Prepare BAM file for variant calling

 

Steps to perform:

Before we do variant calling, your BAM file (from previous step) has to go through these steps:

  1. Add read groups to the sorted BAM file  - GATK won't run without read groups
  2. Create a dictionary file for your reference fasta file - run this only once for the reference genome
  3. Remove duplicate alignments
  4. Index the duplicate removed BAM file



Let's do this then:

 

module load picardTools
module load SAMTools
cd ~/SNPanalysis-model/bwa
java  -Xmx10g -jar $PICARD/AddOrReplaceReadGroups.jar \
                         INPUT=sorted_sample_NA12878_bwa.bam OUTPUT=sorted_sample_NA12878_rg.bam \
                         RGSM= SAMPLE_12878 RGID= SAMPLE_12878 RGLB=TruSeq RGPL=illumina \
                         RGPU=12878 VALIDATION_STRINGENCY=SILENT 
cd ~/SNPanalysis-model/reference
java -Xmx10g -jar $PICARD/CreateSequenceDictionary.jar R=chr20.fa O=chr20.dict
cd ~/SNPanalysis-model/bwa
java -Xmx10g -jar $PICARD/MarkDuplicates.jar \
                      I=sorted_sample_NA12878_rg.bam O=sample_NA12878_rmdup.bam \
                      M=dup_metrics REMOVE_DUPLICATES=true AS=TRUE VALIDATION_STRINGENCY=LENIENT \
                      MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
samtools index sample_NA12878_rmdup.bam 

 

 

Now, we are all set for variant calling!!! 

Two most commonly used SNP callers: GATK and SAMTools mpileup - BCF tools

GATK is designed to work best with human, mouse data! You are lucky if you have one.

 

Variant calling with SAMTools

 

SNP calling in SAMTools is a very user-friendly quick to run compared to GATK. Refer here for more information: http://samtools.sourceforge.net/mpileup.shtml

Maybe is this the reason community trusts GATK?

 

SAMTools mpileup - command will generate coverage data and BCFTools will use this to find SNPs.

 

Let's try running this now:

module load SAMTools
cd ~/SNPanalysis-model/SAMTools
samtools mpileup -uf ~/SNPanalysis-model/reference/chr20.fa \
                                  ~/SNPanalysis-model/bwa/sample_NA12878_rmdup.bam \
                                   |  bcftools view -bvcg - >NA12878_realigned.bcf
 
bcftools view NA12878_realigned.bcf | vcfutils.pl varFilter -D100 > NA12878_realigned.vcf

 

We end up with a .vcf file. VCF - Variant Calling Format

Let's open the file and see what are the columns we have here

 

You like to filter SNPs with Quality > 30? Let's give it a go. 

grep -v "^#" NA12878_realigned.vcf  | awk '{if ($6>30) print $0}' |wc -l
grep -v "^#" NA12878_realigned.vcf  | awk '{if ($6>30) print $0}' > filtered_SNPs_quality30.vcf

How many SNPs you see without filtering? How many you lose when you filter?

 

Variant calling with GATK

 

GATK - has a series of steps for variant calling. It seems a bit tedious one, but once you have all these commands in a script, you could just launch and forget about it (make sure you provide enough walltime and memory for your jobs as GATK is memory intensive program)

 

They have very informative presentations that are very helpful: https://www.broadinstitute.org/gatk/guide/presentations

Also, look here: https://wikis.utexas.edu/pages/viewpage.action?pageId=67799577

 

GATK workflow

 

 

We will try to cover major areas of the workflow today. We shall run step by step so we don't get lost..

 

module load GATK

cd ~/SNPanalysis-model/GATK


java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator \
                                        -R  ~/SNPanalysis-model/reference/chr20.fa \
                                        -I  ~/SNPanalysis-model/bwa/sample_NA12878_rmdup.bam \
                                        -o sample_NA12878_rmdup.bam.realign.intervals

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T IndelRealigner \
                                        -R  ~/SNPanalysis-model/reference/chr20.fa \
                                         -I  ~/SNPanalysis-model/bwa/sample_NA12878_rmdup.bam \
                                       --filter_mismatching_base_and_quals -targetIntervals sample_NA12878_rmdup.bam.realign.intervals \
                                       -o sample_NA12878_realigned.bam

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T BaseRecalibrator \
                                        -R ~/SNPanalysis-model/reference/chr20.fa -I sample_NA12878_realigned.bam \
                                        -knownSites ~/SNPanalysis-model/reference/hg19_dbsnp.vcf \
                                        -o sample_NA12878_recal.grp

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T PrintReads \
                                          -R ../reference/chr20.fa -I sample_NA12878_realigned.bam \
                                         -BQSR sample_NA12878_recal.grp -o sample_NA12878_recal.bam

 

Here we have a BAM file that has been processed for indel realignment and base quality recalibration. Let's go for SNP calling to obtain VCF files.

 

Calling SNPs with GATK

 

Two methods to call SNPs in GATK, UnifiedGenotyper - mostly used and is in the free version and HaplotypeCaller - for only diploid, only academic version, uses de novo assembly
We can get to see how both works..
                  
Once you have a VCF file from either of the callers above, we have to do Variant Quality score recalibration (VQSR) to get set of clean SNPs 

cd ~/SNPanalysis-model/GATK

module load GATK

 

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper \

                                        -R ~/SNPanalysis-model/reference/chr20.fa -I sample_NA12878_recal.bam \

                                        -o snps_NA12878_uniGeno.vcf

 

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller \

                                       -R ~/SNPanalysis-model/reference/chr20.fa -I sample_NA12878_recal.bam \

                                        -o haplotype_NA12878_raw.vcf

 

Some VCF file fields are different in SAMTools and GATK, watch out when filtering SNPs!!

 

Variant quality score recalibration

 

All these statistics values are upto my knowledge of understanding and adapted with defaults to my project. Please go through the manual for GATK. Change the mode to Indels here if you want to do Indel calling. I have used SNP here for SNP calling. 

cd ~/SNPanalysis-model/GATK

module load GATK


java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T VariantRecalibrator \
                                         -R ~/SNPanalysis-model/reference/chr20.fa -input snps_NA12878_uniGeno.vcf \
                                        --maxGaussians 4 -resource:known=true,training=true,truth=true,prior=8.0 \
                                         ~/SNPanalysis-model/reference/hg19_dbsnp.vcf -an QD -an MQRankSum -an HaplotypeScore -an FS \
                                        -an MQ -mode SNP -recalFile uniGeno_NA12878.recal -tranchesFile uniGeno_NA12878.tranches -rscriptFile uniGeno_NA12878.plots.R

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration \
                                         -R ~/SNPanalysis-model/reference/chr20.fa -input snps_NA12878_uniGeno.vcf \
                                         -mode SNP -recalFile uniGeno_NA12878.recal -tranchesFile uniGeno_NA12878.tranches \
                                        -o recal_uniGeno_NA12878.SNPs.vcf -ts_filter_level 99.0

 

 

What's next?? 

We have VCF files but have no idea about the biological significance? Use annotation tools. Variant Effect Predictor (VEP), SNPEff, annovar are few of the commonly used tools.

 

VEP: http://www.ensembl.org/info/docs/tools/vep/index.html

SNPEff: http://snpeff.sourceforge.net/

 

 

We can repeat all of the steps above, or we can make a script to do it for us.  We'll do "both"

  • No labels