Space Shortcuts

  • 2015-10-01: Genome Mapping and Variant Analysis, Dharanya Sampath
Skip to end of metadata
Go to start of metadata

Genome Mapping and variant analysis

Table of Contents:

Genome Mapping

1) A reference genome or even a close relative is available 

2) Lot of open source tools available to map next-gen sequencing reads


So when do we use it?

A) You can get MORE information on your reads for a good reference.

B) Using SNP database to annotate and filter variants

C) You can look for GENES that are responsible for a specific trait

D) Genome Wide Association Studies (GWAS)

E) TARGET specific part in the genome such as Exome-capture


 Here we have people from mixed background working on organisms from Bacteria to poorly sequenced plants and animals. This course will try to give you some basic tools and workflows to analyze multiple data types on HPCC style resources to get your analysis moving as quickly as possible.  You should leave here with improved understanding of some of the following:

I) How to use iCER resources to do analysis

II) A set of tools/workflows for variant calling

III) An understanding of where to find resources and databases


How Does This Webpage Work?

We will have three types of text.  Material that is part of a discussion, which will be in plain text, like this: 


See what I did there?

We will have text that you can copy and paste into a file:

Expected output will be displayed in "info" boxes:

Recieved input is : something you typed


We will also have commands to copy and paste into a command line.  They will look like this:


ls -latrh


Easy, right?


So, what are we going to do?

Mapping and variant calling workflows


The workflow should look something like this:


For a more detailed picture of the workflow see here: Variant calling workflow

Model organisms:

Human, Mouse, Rat, Arabidopsis, Rice, Drosophila

Tools and resources well defined.

Semi-Model organisms:

Sequenced, unfinished genomes, maybe missing a SNP database to compare

Extra work required.


Where to find your reference genome?

  • Microbes (<10Mb): download FASTA containing in sequence and/or GenBank/EMBL/ GFF flat files encapsulating both sequence and features.
  • Macrobes (>100Mb): download specific consortium "build" of reference (Ex: hg19), consisting of FASTA, and various files used to construct a database of feature (PTT, GFF). 
  • Non-model organisms: build your own? de novo assembly (outside scope of course) 


From University of Texas at Austin:

Other papers:


Let's Get to Work:

1) log on to (ssh -XY

Connecting to the HPCC

2) Type:

module load powertools

getexample SNPanalysis-model

cd SNPanalysis-model



Please note: When you log out and log back in to HPCC, run ./ to enter the working node. YOU CANNOT RUN ANY JOBS FROM GATEWAY NODE

You should see:


[sampathd@dev-intel10 ~]$ getexample SNPanalysis-model

[sampathd@dev-intel10 sampathd]$ cd SNPanalysis-model/

[sampathd@dev-intel10 SNPanalysis-model]$ ls
data reference

[sampathd@dev-intel10 SNPanalysis-model]$ ./
qsub: waiting for job 22491825.mgr-04.i to start


If you don't see this, time to use the red sticky.


We are logging in to special nodes devoted to the class.  In normal circumstances, you would be doing this work on one of our development nodes.  More on this later.


Once the login completes

Now, let's look at our data:

cd SNPanalysis-model

 ls -l



[sampathd@css-032 SNPanalysis-model]$ ls -l

total 12
drwxrwxr-x 2 sampathd staff-np 4096 Sep 30 17:41 data
-rwxrwxr-x 1 sampathd staff-np 96 Sep 30 17:41
drwxrwxr-x 2 sampathd staff-np 4096 Sep 30 17:41 reference

cd data


You will see the data files:


NA12878_1.fastq.gz NA12878_2.fastq.gz

NA12891_1.fastq.gz NA12891_2.fastq.gz

NA12892_1.fastq.gz NA12892_2.fastq.gz

What we have are 6 sets of paired-end Illumina reads data from human


A quick word on fastq:

Each read has 4 lines.  Name, Sequence, Note, and Quality 


@ERR315325.1 HISEQ:79:H0CC7ADXX:2:1101:1238:1994/1

So, now we get right to work, right?



Quality Assessment and Quality Control

I would recommend one of the following 2 protocols:


Trimmomatic -> FastQC

FastQC -> Trimmomatic -> FastQC

Here's an example of the outputs:


Here is how we do it:

(Before we start, we're going to make small subsets of the data so that we can finish in a reasonable time frame):


cd ~/SNPanalysis-model

cd data

zcat NA12878_1.fastq.gz | head -n 1000000 > sample_NA12878_1.fastq

zcat NA12878_2.fastq.gz | head -n 1000000 > sample_NA12878_2.fastq


We will take 100000 read pairs from one human sample, and use them for our analysis.  

We will come back and try this again on other samples.


Now that we have data...  Time for some reproducible research:

1) We name things so that we can remember what we did.

2) When you run things, save the commands VERBATIM in a single file.  

3) Try to organize your data in a way that you can find them.

you might want to have the following directories for any given analysis:







if not,

mkdir bwa GATK SAMTools


Let's do quality check on our data:

cd ~/SNPanalysis-model

cd data

module load FastQC

fastqc sample_NA12878_1.fastq

fastqc sample_NA12878_2.fastq


Let's look at the results:

firefox sample_NA12878_1_fastqc.html

OR you can do the whole thing interactively:



If you have overrepresented sequences or adapters to trim, use Trimmomatic tool similar to the this tutorial: 2015-09-21: RNA-Sequencing Tools


Now, we're getting somewhere. We can align it.

Mapping reads to reference genome

Mapping determines a "seed" position where the read shares a subsequence with the reference. But, is this the best match?

Alignment starts with the seed and determines how the read is best aligned on a base-by-base basis around the seed.

Seed→Alignment score→Mapping quality

From reads to alignments

  1. Build index for your reference genome(specific to the mapping tool used)
  2. Run mapping tool
  3. Convert SAM -> BAM if the tool doesn't already support it
  4. Sort the BAM file 
  5. Build BAM index for sorted BAM alignment file



"Reference" is always wrong.

Alignment is always less than perfect

Large reference may take longer time to map

Shorter reads will take longer than usual as it tries to map at many places in the reference genome


Let's try mapping our sub-sampled set:



To save time, I have already constructed bwa index for the reference genome. For the course, I am choosing Human Chromosome 20 to speed up the process. 


I used the following command to build index:  module load bwa; bwa index chr20.fa chr20


Now we have a reference index, so let's try running bwa mapper and sort, index the BAM file

module load SAMTools

module load bwa

cd ~/SNPanalysis-model/bwa

bwa mem -t 4 ~/SNPanalysis-model/reference/chr20.fa \

                       ~/SNPanalysis-model/data/sample_NA12878_1.fastq ~/SNPanalysis-model/data/sample_NA12878_2.fastq \

                       | samtools view - -bS -o sample_NA12878_bwa.bam


samtools sort sample_NA12878_bwa.bam sorted_sample_NA12878_bwa


samtools index sorted_sample_NA12878_bwa.bam


So, what is this doing?


What is a BAM/SAM file format? 



SRR030257.264529 99 NC_012967 1521 29 34M2S = 1564 79 CTGGCCATTATCTCGGTGGTAGGACATGGCATGCCCAAAAAA;AA;AAAAAA??A%.;?&'3735',()0*, XT:A:M22NM:i:322SM:i:2922AM:i:2922XM:i:322XO:i:022XG:i:022MD:Z:23T0G4T42

SRR030257.2669090 147 NC_012967 1521 60 36M = 1458 Y99 CTGGCCATTATCTCGGTGGTAGGTGATGGTATGCGC<<9:<<AAAAAAAAAAAAAAAAAAAAAAAAAAAAA XT:A:U22NM:i:022SM:i:3722AM:i:3722X0:i:122X1:i:022XM:i:022XO:i:022XG:i:022MD:Z:36


How to view a BAM file?

What are all these columns?

For more information on SAM flags and columns:

To find out instantly about SAM flags:


OK, now we should have an output BAM file:

module load SAMTools

samtools view sample_NA12878_bwa.bam | less -S

Now we have data!!!

What do we do with it?  

Data visualization


Can you visualize the BAM file in a way that it makes sense? Yes, of course!!

There are tools like IGV, Tablet, etc developed especially for viewing BAM files


module load IGV



Sample IGV screenshot: 


Prepare BAM file for variant calling


Steps to perform:

Before we do variant calling, your BAM file (from previous step) has to go through these steps:

  1. Add read groups to the sorted BAM file  - GATK won't run without read groups
  2. Create a dictionary file for your reference fasta file - run this only once for the reference genome
  3. Remove duplicate alignments
  4. Index the duplicate removed BAM file

Let's do this then:


module load picardTools
module load SAMTools
cd ~/SNPanalysis-model/bwa
java  -Xmx10g -jar $PICARD/AddOrReplaceReadGroups.jar \
                         INPUT=sorted_sample_NA12878_bwa.bam OUTPUT=sorted_sample_NA12878_rg.bam \
                         RGSM= SAMPLE_12878 RGID= SAMPLE_12878 RGLB=TruSeq RGPL=illumina \
                         RGPU=12878 VALIDATION_STRINGENCY=SILENT 
cd ~/SNPanalysis-model/reference
java -Xmx10g -jar $PICARD/CreateSequenceDictionary.jar R=chr20.fa O=chr20.dict
cd ~/SNPanalysis-model/bwa
java -Xmx10g -jar $PICARD/MarkDuplicates.jar \
                      I=sorted_sample_NA12878_rg.bam O=sample_NA12878_rmdup.bam \
samtools index sample_NA12878_rmdup.bam 



Now, we are all set for variant calling!!! 

Two most commonly used SNP callers: GATK and SAMTools mpileup - BCF tools

GATK is designed to work best with human, mouse data! You are lucky if you have one.


Variant calling with SAMTools


SNP calling in SAMTools is a very user-friendly quick to run compared to GATK. Refer here for more information:

Maybe is this the reason community trusts GATK?


SAMTools mpileup - command will generate coverage data and BCFTools will use this to find SNPs.


Let's try running this now:

module load SAMTools
cd ~/SNPanalysis-model/SAMTools
samtools mpileup -uf ~/SNPanalysis-model/reference/chr20.fa \
                                  ~/SNPanalysis-model/bwa/sample_NA12878_rmdup.bam \
                                   |  bcftools view -bvcg - >NA12878_realigned.bcf
bcftools view NA12878_realigned.bcf | varFilter -D100 > NA12878_realigned.vcf


We end up with a .vcf file. VCF - Variant Calling Format

Let's open the file and see what are the columns we have here


You like to filter SNPs with Quality > 30? Let's give it a go. 

grep -v "^#" NA12878_realigned.vcf  | awk '{if ($6>30) print $0}' |wc -l
grep -v "^#" NA12878_realigned.vcf  | awk '{if ($6>30) print $0}' > filtered_SNPs_quality30.vcf

How many SNPs you see without filtering? How many you lose when you filter?


Variant calling with GATK


GATK - has a series of steps for variant calling. It seems a bit tedious one, but once you have all these commands in a script, you could just launch and forget about it (make sure you provide enough walltime and memory for your jobs as GATK is memory intensive program)


They have very informative presentations that are very helpful:

Also, look here:


GATK workflow



We will try to cover major areas of the workflow today. We shall run step by step so we don't get lost..


module load GATK

cd ~/SNPanalysis-model/GATK

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T RealignerTargetCreator \
                                        -R  ~/SNPanalysis-model/reference/chr20.fa \
                                        -I  ~/SNPanalysis-model/bwa/sample_NA12878_rmdup.bam \
                                        -o sample_NA12878_rmdup.bam.realign.intervals

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T IndelRealigner \
                                        -R  ~/SNPanalysis-model/reference/chr20.fa \
                                         -I  ~/SNPanalysis-model/bwa/sample_NA12878_rmdup.bam \
                                       --filter_mismatching_base_and_quals -targetIntervals sample_NA12878_rmdup.bam.realign.intervals \
                                       -o sample_NA12878_realigned.bam

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T BaseRecalibrator \
                                        -R ~/SNPanalysis-model/reference/chr20.fa -I sample_NA12878_realigned.bam \
                                        -knownSites ~/SNPanalysis-model/reference/hg19_dbsnp.vcf \
                                        -o sample_NA12878_recal.grp

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T PrintReads \
                                          -R ../reference/chr20.fa -I sample_NA12878_realigned.bam \
                                         -BQSR sample_NA12878_recal.grp -o sample_NA12878_recal.bam


Here we have a BAM file that has been processed for indel realignment and base quality recalibration. Let's go for SNP calling to obtain VCF files.


Calling SNPs with GATK


Two methods to call SNPs in GATK, UnifiedGenotyper - mostly used and is in the free version and HaplotypeCaller - for only diploid, only academic version, uses de novo assembly
We can get to see how both works..
Once you have a VCF file from either of the callers above, we have to do Variant Quality score recalibration (VQSR) to get set of clean SNPs 

cd ~/SNPanalysis-model/GATK

module load GATK


java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T UnifiedGenotyper \

                                        -R ~/SNPanalysis-model/reference/chr20.fa -I sample_NA12878_recal.bam \

                                        -o snps_NA12878_uniGeno.vcf


java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T HaplotypeCaller \

                                       -R ~/SNPanalysis-model/reference/chr20.fa -I sample_NA12878_recal.bam \

                                        -o haplotype_NA12878_raw.vcf


Some VCF file fields are different in SAMTools and GATK, watch out when filtering SNPs!!


Variant quality score recalibration


All these statistics values are upto my knowledge of understanding and adapted with defaults to my project. Please go through the manual for GATK. Change the mode to Indels here if you want to do Indel calling. I have used SNP here for SNP calling. 

cd ~/SNPanalysis-model/GATK

module load GATK

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T VariantRecalibrator \
                                         -R ~/SNPanalysis-model/reference/chr20.fa -input snps_NA12878_uniGeno.vcf \
                                        --maxGaussians 4 -resource:known=true,training=true,truth=true,prior=8.0 \
                                         ~/SNPanalysis-model/reference/hg19_dbsnp.vcf -an QD -an MQRankSum -an HaplotypeScore -an FS \
                                        -an MQ -mode SNP -recalFile uniGeno_NA12878.recal -tranchesFile uniGeno_NA12878.tranches -rscriptFile uniGeno_NA12878.plots.R

java -Xmx10g -cp $GATK -jar $GATK/GenomeAnalysisTK.jar -T ApplyRecalibration \
                                         -R ~/SNPanalysis-model/reference/chr20.fa -input snps_NA12878_uniGeno.vcf \
                                         -mode SNP -recalFile uniGeno_NA12878.recal -tranchesFile uniGeno_NA12878.tranches \
                                        -o recal_uniGeno_NA12878.SNPs.vcf -ts_filter_level 99.0



What's next?? 

We have VCF files but have no idea about the biological significance? Use annotation tools. Variant Effect Predictor (VEP), SNPEff, annovar are few of the commonly used tools.






We can repeat all of the steps above, or we can make a script to do it for us.  We'll do "both"

  • No labels