*This archived content is for the previous version of the HPC operating. Please see 2018 [Bioinformatics] page for the latest information on currently supported software*

Skip to end of metadata
Go to start of metadata

This tutorial has been modified slightly from one originally provided as part of the ANGUS bioinformatics course. Modifications have been made to accommodate following the tutorial on the HPCC instead of Amazon EC2.

Checking read quality with FastQC

When you get your sequences back from a sequencing facility, it’s important to check that they are high quality (garbage in, garbage out). In this tutorial, we’ll use software called FastQC which checks whether a set of sequence reads in a .fastq file exhibit any unusual qualities (which might indicate either low sequence quality, or interesting biological features in your sample).

Getting the data

The data used in this tutorial has already been preselected and downloaded for your convenience. It is located in the HPCC directory:


Simply copy the following files over to your working directory.  First, a "good" sequence in fastq format:

cp /mnt/research/common-data/Bio/AngusData/good_sequence_short.fastq .

Then a "bad" one:

cp /mnt/research/common-data/Bio/AngusData/bad_sequence_short.fastq .

Running FastQC

To run FastQC on the HPCC in interactive mode, you will need to establish an X-connection over SSH.  On workstations using the Mac or Linux operating system, simply open a terminal and enter:

ssh -X someUser@hpcc.msu.edu

For Windows users, you will need PuTTY and Xming or Cygwin-X to establish an X-connection over SSH.  You can follow these instructions for Xming, or stop by the HPCC and pick-up a preloaded thumb drive with the software you need.

Once you are connected to Gateway with an X-session, you will need to login to one of the dev-nodes before running FastQC:

ssh dev-amd09

Now, simply load the module file for FastQC (remember to do this on a dev-node):

module load FastQC

There are two ways in which FastQC can be run: in "command line" mode, or as a GUI (graphical user interface).  This tutorial addresses the command line version of FastQC.  Let's start by analyzing our "good" file:

fastqc ./good_sequence_short.fastqc

This will generate a self-contained directory called "good_sequence_short_fastqc" which contains an HTML formatted report that can be loaded into a browser. If we change into that directory and list the contents of the file "summary.txt" we can see which tests passed and which failed:

cd good_sequence_short_fastqc
cat summary.txt
PASS	Basic Statistics	good_sequence_short.fastq
PASS	Per base sequence quality	good_sequence_short.fastq
PASS	Per sequence quality scores	good_sequence_short.fastq
WARN	Per base sequence content	good_sequence_short.fastq
PASS	Per base GC content	good_sequence_short.fastq
PASS	Per sequence GC content	good_sequence_short.fastq
PASS	Per base N content	good_sequence_short.fastq
PASS	Sequence Length Distribution	good_sequence_short.fastq
PASS	Sequence Duplication Levels	good_sequence_short.fastq
PASS	Overrepresented sequences	good_sequence_short.fastq

If we were to open the file "fastqc_report.html" in a browser, we would see:

The image above presents only a small portion of the output you receive from FastQC. This has been provided only for demonstration purposes. Please scroll down through your FastQC results to see other useful charts and tables, or click on the links in the lefthand pane.

Now we can repeat this procedure using our file of "bad" sequences:

fastqc ./bad_sequence_short.fastqc

Which produces:

cd bad_sequence_short_fastqc
cat summary.txt
PASS	Basic Statistics	bad_sequence_short.fastq
FAIL	Per base sequence quality	bad_sequence_short.fastq
PASS	Per sequence quality scores	bad_sequence_short.fastq
WARN	Per base sequence content	bad_sequence_short.fastq
WARN	Per base GC content	bad_sequence_short.fastq
WARN	Per sequence GC content	bad_sequence_short.fastq
PASS	Per base N content	bad_sequence_short.fastq
PASS	Sequence Length Distribution	bad_sequence_short.fastq
WARN	Sequence Duplication Levels	bad_sequence_short.fastq
WARN	Overrepresented sequences	bad_sequence_short.fastq
FAIL	Kmer Content	bad_sequence_short.fastq

Running FastQC in GUI Mode

If you want to run FastQC in GUI mode, logon to the HPCC using an X-windows session, load the module file and start FastQC as follows:

module load FastQC
fastqc &

A video has been prepared by the FastQC developers which illustrates how to use this application in GUI mode.