Theoretical and Practical HiC Workshop
Quality control of NGS data
Learning objectives
- Understand the FastQ format.
- Understand the concept of sequence quality.
- Learn how to use software to measure and improve the quality of massive sequencing data.
Downloading the data
The first thing we do when receiving our data is to download them. Our collaborator has provided the link. Once downloaded we will copy them into our docker container using the following command. Remember to replace the id by your docker id:
$ docker cp FastQC_Short.tar.gz a646764e06a0:/usr/local/data/FastQC_Short.tar.gz
They are usually in a compressed format. We can decompress the data using the tar
command:
$ tar -xvf FastQC_Short.tar.gz
This command decompresses a directory called FastQC_Short
. We enter that directory:
$ cd /usr/local/data/FastQC_Short
Revise its contents:
$ ls
Partial_SRR2467141.fastq
Partial_SRR2467142.fastq
Partial_SRR2467143.fastq
Partial_SRR2467144.fastq
Partial_SRR2467145.fastq
Partial_SRR2467146.fastq
Partial_SRR2467147.fastq
Partial_SRR2467148.fastq
Partial_SRR2467149.fastq
Partial_SRR2467150.fastq
Partial_SRR2467151.fastq
Let’s review one of the files using the head command. It usually shows us the first 10 lines of a file but with the -n
flag will show us the first 12:
$ head -n 12 Partial_SRR2467141.fastq
@SRR2467141.1 SALLY:472:C6NCDACXX:8:1101:1392:1873 length=101
NTTATTTTTTTCGTTCTTCTTGAAGAAGACGTTACCTACGGCGTATTTGCCCATCTCAGGTATGTCTAGATCAAGATCTAACTTGAATTCTCTTTTCATAA
+SRR2467141.1 SALLY:472:C6NCDACXX:8:1101:1392:1873 length=101
#1:=BDDDHDHB?<?CFGGGC9@FF@GGGG>EEEDGDHGFHGE;AEFH>AC@D;@B>C>CCC@C>>DDCC3:>AA5>CC>>CCD>@CCDCDCCCCC@C@>C
@SRR2467141.2 SALLY:472:C6NCDACXX:8:1101:1326:1950 length=101
CACCCATTGACTGGCCAAATGCCCCATTATTTTGAGTATTGTTATTTCCAAATAAACTGTTACTATTACTGCCAGCGGCAGAAGTGAATCCACAGATCGGA
+SRR2467141.2 SALLY:472:C6NCDACXX:8:1101:1326:1950 length=101
??@DFFFFHHFH?GEHEHIDEHCCFEHIE@GIGCHG<DGGIHIGIGGIGHIIFIHFFGDHGGIIHIGIIIIGGEH@EADB>=AA>3@>CCCCCCBC@C###
@SRR2467141.3 SALLY:472:C6NCDACXX:8:1101:1477:1959 length=101
CATCTTTTCTTTAGGCACATCATCCTGATAAGTGTACTTACCAGGATATATACCATCGGTATTGATGTTATCGGCATCACATAAAACTAATTCACCAGAAA
+SRR2467141.3 SALLY:472:C6NCDACXX:8:1101:1477:1959 length=101
@CCFFFFFHHHHHJJJIIJJIJJJJJJEIJJJIIJJJIJIJJIJJIIIJJJIIIHIIIJDHIJJIGJJIJJJJJIHHHFFFFFEEEEEEDDEDDDDDDDDD
This file is in fastq
format. This is the format that most modern sequencing platforms generate. This format is a modification of the sequence format Fasta. You can recognize them because they end with the extension .fq
or .fastq
.
FastQ files are plain ASCII text files that contain both the sequences detected by the sequencer as well as information about the quality of each one of the sequenced nucleotides.
We will take a single sequence as an example:
@SRR2467141.1 SALLY:472:C6NCDACXX:8:1101:1392:1873 length=101
NTTATTTTTTTCGTTCTTCTTGAAGAAGACGTTACCTACGGCGTATTTGCCCATCTCAGGTATGTCTAGATCAAGATCTAACTTGAATTCTCTTTTCATAA
+SRR2467141.1 SALLY:472:C6NCDACXX:8:1101:1392:1873 length=101
#1:=BDDDHDHB?<?CFGGGC9@FF@GGGG>EEEDGDHGFHGE;AEFH>AC@D;@B>C>CCC@C>>DDCC3:>AA5>CC>>CCD>@CCDCDCCCCC@C@>C
We see that each sequence is represented by 4 lines.
- Sequence name - This line begins with the symbol
@
, followed by the read name. - The nucleotide sequence.
- Second title - This line starts with the symbol
+
. Generally the information is the same as in the first line but it can also be blank as long as it begins with the symbol+
. - Quality information - It contains an ASCII character string. Every one of these characters corresponds to a nucleotide of the sequence and represents the quality of the same. The score of each nucleotide indicates the level of confidence that you have in that base. High levels indicate that the base has been reported correctly while low levels suggest that there is uncertainty about the actual sequence in this position.
The name of the sequence also contains important information. In In this case, the Illumina data provide the following information:
@<instrument>:<run>:<flow cell identifier>:<lane>:<square>:<x-coord>:<y-coord> <read>:<if the ead was filtered>:<control number>:<sequence index> length=<sequence size>
The level of quality represented in the fourth line is called Phred score. In its original format a Phred score is simply the logarithm of the probabilities of error:
Phred score = - 10 * log10(error probability)
Quality score | Error probability |
---|---|
Q40 0.0001 | (1 en 10,000) |
Q30 0.001 | (1 en 1,000) |
Q20 0.01 | (1 en 100) |
Q10 0.1 | (1 en 10) |
Wait, if the Phred scores are numbers, why do the quality scores shown in line four are a mixture of alphanumeric symbols?
#1:=BDDDHDHB?<?CFGGGC9@FF@GGGG>EEEDGDHGFHGE;AEFH>AC@D;@B>C>CCC@C>>DDCC3:>AA5>CC>>CCD>@CCDCDCCCCC@C@>C
This is because these scores are encoded in ASCII (American Standard Code for Informational Interchange). In short, ASCII is a coding system that equates a number to a character alphanumeric. For example, the character ‘A’ is represented by the number 65 in the table ASCII code, while ‘%’ is represented by the number 37. This system allows us to represent a total of 256 different characters.
Why use ASCII? Given the huge amount of data produced during massive parallel sequencing, it’s all about reducing the data to the maximum. The ASCII system allows us to represent two-digit numbers in a single bit (8 bits), which reduces the space they occupy these files. If you are interested in the subject, you can read more about binary coding.
Finally, since this quality coding system was invented, which is already used with Sanger-type sequencing, different companies have “slipped” the scales of ASCII conversion because what is important to verify with the company or laboratory in which version of this scale have been coded to use the correct conversion. Some of the quality control tools that we will use infer the type of coding used directly from the data provided.
Verifying the quality of the sequences
We will verify the quality of the sequences using the program FastQC. This is one of the most used programs for this type of analysis, since it provides a global vision of how the data is structured, as well as being pretty fast.
FastQC reads a series of sequencing files and produces a report of quality control for each one, which consists of a number of different modules, each of which helps us identify different problems in our data.
Let’s analyze our first sequencing file, first we created a directory to store our results:
$ mkdir QUAL
And we perform our quality analysis using FastQC:
$ fastqc -O ./QUAL/ Partial_SRR2467141.fastq
Started analysis of Partial_SRR2467141.fastq
Approx 20% complete for Partial_SRR2467141.fastq
Approx 40% complete for Partial_SRR2467141.fastq
Approx 60% complete for Partial_SRR2467141.fastq
Approx 80% complete for Partial_SRR2467141.fastq
Analysis complete for Partial_SRR2467141.fastq
Once finished, let’s review the result:
$ cd QUAL
$ ls
Partial_SRR2467141_fastqc.html
Partial_SRR2467141_fastqc.zip
The easiest way to explore these results is by opening the html file in your browser. You can download them to a working directory in your computer using the command:
$ docker cp a646764e06a0:/usr/local/data/FastQC_Short/QUAL/Partial_SRR2467141_fastqc.html .
You can open it by double clicking on the file.
The results should be similar to this.
This page contains a lot of information broken down into the following sections:
- Basic Statistics - The basic statistics of the experiment.
- Per base sequence quality - Box diagrams showing the quality of each base.
- Per tile sequence quality - It contains the box (tile) from which each sequence comes. It only appears if the analyzes are made in an Illumina library that retains its original identifiers.
- Per sequence quality scores - It allows to see if a subgroup of sequences has bad quality.
- Per base sequence content - Shows the proportion of each base in each position.
- Per sequence GC content - Compares the observed GC distribution with a modeled GC distribution.
- Per base N content - Indicates how many nucleotides could not be interpreted and are indicated with an N.
- Sequence Length Distribution - Shows the sequence length distribution.
- Sequence Duplication Levels - Indicate how many times each sequence is repeated. It is done only in the first 100,000 sequences.
- Overrepresented sequences - Shows over represented sequences.
- Adapter Content - Shows the content of adapters in the library.
Let’s review the results once more by clicking on the link.
In general we see that most of the criteria have a green tick of quality control pass, with the exception of per base sequence content, GC content per base and overrepresented sequences.
Although we did not provide this information, the program has inferred the coding ASCII (Sanger / Illumina 1.9) as well as the number of sequences, their size and its GC content.
When we observe the quality per base, we see that most of the bases are in the area green or have good quality. Each base is represented by a box diagram that provides a good idea about the distribution of quality scores in all the sequences. It is evident that the quality decreases more markedly at the end of the sequence. This is known as small errors accumulate while the longer the sequence, this is one of the reasons why sequencers based in nucleotide incorporation they have a limit on the maximum length of their readings.
The graph showing the quality per tile is almost completely blue showing that the average distribution of errors in the tile is very similar.
The graph of quality score per sequence shows that most of the sequences have high scores.
However, the distribution of nucleotides per base shows that, at the beginning of the sequence, the distribution is very disparate.
We also observe that:
- The GC distribution per sequence is somewhat different to the hypothetical distribution.
- There are few ambiguous bases (Ns).
- All sequences have the same size.
- Levels of duplication of sequences are very low.
- There are a few over represented sequences. Mostly Illumina PCR primers.
- There are no adapters.
When running FastQC, a compressed file called Partial_SRR2467141_fastqc.zip
.
We unpack this file using the following command:
$ unzip Partial_SRR2467141_fastqc.zip
Archive: Partial_SRR2467141_fastqc.zip
creating: Partial_SRR2467141_fastqc/
creating: Partial_SRR2467141_fastqc/Icons/
creating: Partial_SRR2467141_fastqc/Images/
inflating: Partial_SRR2467141_fastqc/Icons/fastqc_icon.png
inflating: Partial_SRR2467141_fastqc/Icons/warning.png
inflating: Partial_SRR2467141_fastqc/Icons/error.png
inflating: Partial_SRR2467141_fastqc/Icons/tick.png
inflating: Partial_SRR2467141_fastqc/summary.txt
inflating: Partial_SRR2467141_fastqc/Images/per_base_quality.png
inflating: Partial_SRR2467141_fastqc/Images/per_tile_quality.png
inflating: Partial_SRR2467141_fastqc/Images/per_sequence_quality.png
inflating: Partial_SRR2467141_fastqc/Images/per_base_sequence_content.png
inflating: Partial_SRR2467141_fastqc/Images/per_sequence_gc_content.png
inflating: Partial_SRR2467141_fastqc/Images/per_base_n_content.png
inflating: Partial_SRR2467141_fastqc/Images/sequence_length_distribution.png
inflating: Partial_SRR2467141_fastqc/Images/duplication_levels.png
inflating: Partial_SRR2467141_fastqc/Images/adapter_content.png
inflating: Partial_SRR2467141_fastqc/fastqc_report.html
inflating: Partial_SRR2467141_fastqc/fastqc_data.txt
inflating: Partial_SRR2467141_fastqc/fastqc.fo
If we enter the newly created directory, we can see the report in plain text format, both the summary:
$ cd Partial_SRR2467141_fastqc
$ more summary.txt
PASS Basic Statistics Partial_SRR2467141.fastq
PASS Per base sequence quality Partial_SRR2467141.fastq
PASS Per tile sequence quality Partial_SRR2467141.fastq
PASS Per sequence quality scores Partial_SRR2467141.fastq
FAIL Per base sequence content Partial_SRR2467141.fastq
PASS Per sequence GC content Partial_SRR2467141.fastq
PASS Per base N content Partial_SRR2467141.fastq
PASS Sequence Length Distribution Partial_SRR2467141.fastq
PASS Sequence Duplication Levels Partial_SRR2467141.fastq
PASS Overrepresented sequences Partial_SRR2467141.fastq
PASS Adapter Content Partial_SRR2467141.fastq
PASS Kmer Content Partial_SRR2467141.fastq
and the full report file:
$ more fastqc_data.txt
We have omitted the content since it is very long. Being able to review the results in flat files is extremely useful when results are in a server that does not have a graphical interface.
FastQC programmers have compiled examples of quality sequencing data diverse Let’s analyze them:
- Good data - Illumina
- Bad data - Illumina
- Dimer contamination
- Small RNAs with read through the adaptor
How valid is our analysis?
Since the data we used only less than 200,000 sequences, can we trust that the complete data will have similar quality? Why or why not?
How do the rest of the files look?
Obtain the fastQC report html files for all other three fastq files.
Cleaning the sequences
There are a number of tools to clean sequences and different researchers have different preferences. Similar to other sequencing problems there is an active development of programs for this purpose.
We will use a program called Trimmomatic. This is a Java based program which, like FastQC, is very fast. It also has filtering modes for single or pair-end reads.
The program performs various functions, including:
- Removal of adapters
- Removal of low quality bases at 5 ‘or 3’ of the read
- Scanning of the sequence through a window of 4 nucleotides, cutting when the average quality of the base drops of 15.
- Delete sequences shorter than a particular size
- Cut a specific number of nucleotides at the 5 ‘or 3’ end
- Conversion of quality scores between different versions of Phred.
The specific functions according to the manual are:
- ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read.
- SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold.
- LEADING: Cut bases off the start of a read, if below a threshold quality
- TRAILING: Cut bases off the end of a read, if below a threshold quality
- CROP: Cut the read to a specified length
- HEADCROP: Cut the specified number of bases from the start of the read
- MINLEN: Drop the read if it is below a specified length
- TOPHRED33: Convert quality scores to Phred-33
- TOPHRED64: Convert quality scores to Phred-64
The configuration changes depending on whether the sequences are single end or paired-end.
You can find more detailed descriptions of their functions in the manual.
Let’s see how it works. The software is already installed on the docker container.
Let’s go back to the folder where our fastq files are.
$ cd ../..
The quality of our test sequences is quite good, but we saw that the quality low at the 3 ‘end, as well as the disproportionate presence of bases at the end 5’. Let’s clean our sequences using Trimmomatic.
We will ask Trimmomatic to cut:
- The Illumina TruSeq single end (SE) adapters from the 3 ’end
- The first 10 nucleotides of the 5’ end
- Bases of poor quality using a window of 4 nucleotides with a minimum average of Phred of 15
- Sequences less than 60 nucleotides.
$ trimmomatic SE Partial_SRR2467141.fastq Trimmed_Partial_SRR2467141.fastq ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 HEADCROP:10 SLIDINGWINDOW:4:15 MINLEN:60
Automatically using 1 threads
java.io.FileNotFoundException: /home/sfernandez/TRANSCRIPTOMICA/FastQC_Short/TruSeq3-SE.fa (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at org.usadellab.trimmomatic.fasta.FastaParser.parse(FastaParser.java:54)
at org.usadellab.trimmomatic.trim.IlluminaClippingTrimmer.loadSequences(IlluminaClippingTrimmer.java:110)
at org.usadellab.trimmomatic.trim.IlluminaClippingTrimmer.makeIlluminaClippingTrimmer(IlluminaClippingTrimmer.java:71)
at org.usadellab.trimmomatic.trim.TrimmerFactory.makeTrimmer(TrimmerFactory.java:32)
at org.usadellab.trimmomatic.Trimmomatic.createTrimmers(Trimmomatic.java:59)
at org.usadellab.trimmomatic.TrimmomaticSE.run(TrimmomaticSE.java:303)
at org.usadellab.trimmomatic.Trimmomatic.main(Trimmomatic.java:85)
Quality encoding detected as phred33
Input Reads: 5000 Surviving: 4856 (97.12%) Dropped: 144 (2.88%)
TrimmomaticSE: Completed successfully
We can see that there were very few sequences that were discarded but still are things that could have affected our analysis and/or assembly. We also see that there is an error:
java.io.FileNotFoundException: /home/sfernandez/CLASS_TEST/FastQC_Short/TruSeq3-SE.fa (No such file or directory)
This is because the sequences of the adapters are not in the same directory as our data. One way to compensate for this is to download the necessary files (these files are provided with trimmomatic but we will download them as we are working on the server)
Download this file and run Trimmomatic again:
$ wget https://raw.githubusercontent.com/timflutre/trimmomatic/master/adapters/TruSeq3-SE.fa
$ trimmomatic SE Partial_SRR2467141.fastq Trimmed_Partial_SRR2467141.fastq ILLUMINACLIP:TruSeq3-SE.fa:2:30:10 HEADCROP:10 SLIDINGWINDOW:4:15 MINLEN:60
Automatically using 1 threads
Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
ILLUMINACLIP: Using 0 prefix pairs, 2 forward/reverse sequences, 0 forward only sequences, 0 reverse only sequences
Quality encoding detected as phred33
Input Reads: 5000 Surviving: 4847 (96.94%) Dropped: 153 (3.06%)
TrimmomaticSE: Completed successfully
This result shows us that the program was executed properly and eliminated sequences which contain the adapters that we indicate. Comparing with our previous results we see that it found 9 sequences with these adapters.
If we use the command ls
we can see that a result file called Trimmed_Partial_SRR2467141.fastq
.
$ ls
Partial_SRR2467141.fastq
Partial_SRR2467142.fastq
Partial_SRR2467143.fastq
Partial_SRR2467144.fastq
Partial_SRR2467145.fastq
Partial_SRR2467146.fastq
Partial_SRR2467147.fastq
Partial_SRR2467148.fastq
Partial_SRR2467149.fastq
Partial_SRR2467150.fastq
Partial_SRR2467151.fastq
QUAL
**Trimmed_Partial_SRR2467141.fastq**
TruSeq3-SE.fa
Let’s now review the results of our sequence cleaning. First we will see the first lines of the clean file.
$ head -n 12 Trimmed_Partial_SRR2467141.fastq
@SRR2467141.1 SALLY:472:C6NCDACXX:8:1101:1392:1873 length=101
TCGTTCTTCTTGAAGAAGACGTTACCTACGGCGTATTTGCCCATCTCAGGTATGTCTAGATCAAGATCTAACTTGAATTCTCTTTTCATAA
+SRR2467141.1 SALLY:472:C6NCDACXX:8:1101:1392:1873 length=101
HB?<?CFGGGC9@FF@GGGG>EEEDGDHGFHGE;AEFH>AC@D;@B>C>CCC@C>>DDCC3:>AA5>CC>>CCD>@CCDCDCCCCC@C@>C
@SRR2467141.2 SALLY:472:C6NCDACXX:8:1101:1326:1950 length=101
CTGGCCAAATGCCCCATTATTTTGAGTATTGTTATTTCCAAATAAACTGTTACTATTACTGCCAGCGGCAGAAGTGAATCCACAGATC
+SRR2467141.2 SALLY:472:C6NCDACXX:8:1101:1326:1950 length=101
FH?GEHEHIDEHCCFEHIE@GIGCHG<DGGIHIGIGGIGHIIFIHFFGDHGGIIHIGIIIIGGEH@EADB>=AA>3@>CCCCCCBC@C
@SRR2467141.3 SALLY:472:C6NCDACXX:8:1101:1477:1959 length=101
TTAGGCACATCATCCTGATAAGTGTACTTACCAGGATATATACCATCGGTATTGATGTTATCGGCATCACATAAAACTAATTCACCAGAAA
+SRR2467141.3 SALLY:472:C6NCDACXX:8:1101:1477:1959 length=101
HHHJJJIIJJIJJJJJJEIJJJIIJJJIJIJJIJJIIIJJJIIIHIIIJDHIJJIGJJIJJJJJIHHHFFFFFEEEEEEDDEDDDDDDDDD
We can see that even in the first lines cuts were made, mostly in the 3’ end.
Let’s now review the global results using FastQC:
$ fastqc -O ./QUAL/ Trimmed_Partial_SRR2467141.fastq
Started analysis of Trimmed_Partial_SRR2467141.fastq
Approx 20% complete for Trimmed_Partial_SRR2467141.fastq
Approx 40% complete for Trimmed_Partial_SRR2467141.fastq
Approx 60% complete for Trimmed_Partial_SRR2467141.fastq
Approx 80% complete for Trimmed_Partial_SRR2467141.fastq
Analysis complete for Trimmed_Partial_SRR2467141.fastq
Open the html result Trimmed_Partial_SRR2467141_fastqc.html.
What are the differences between raw and clean sequences?
Compare html files between raw and clean sequences. Summarizes: What differences are there? What do you think is the cause of these differences?
We save our data using:
docker commit a646764e06a0 lizfernandez/hic-langebio:day1