HiC alignment strategies
Learning objectives
- Setting up genome files
- Run HiCUP steps and inspect the output
- Interpret the HiCUP report
Setting up the genome files
In order to run HiCUP to map and filter HiC reads, two files must be generated: the bowtie2 index, and the digested genome file.
First we generate a bowtie2 index for the genome.
bowtie2-build --threads 1 maize_mini2.fa maize_mini2
Then we use HiCUP digester to generate an in silico digested genome file. - –re1 is the restriction enzyme used in the procol. The restriction site, as well as the cut site (with ^) must be indicated. - –genome is the name of the genome for the output file (optional)
hicup_digester --re1 ^GATC,DpnII --genome maize_mini2 maize_mini2.fa
Let’s inspect the Digested file.
head Digest_maize_mini2_DpnII_None_17-05-28_30-10-2019.txt
HiCUP truncater
The next step is to truncate the sequence downstream of a ligation site of the read.
hicup_truncater --re1 ^GATC,DpnII ZmEn_HiC_sub_1.fq.gz ZmEn_HiC_sub_2.fq.gz
Let’s inspect the truncation results.
less hicup_truncater_summary_wJYYjDJNrR_18-03-50_26-10-2019.txt
We expect a higher percentage of truncated reads with longer reads (~150 nts). Another factor is the distribution of restriction fragment lengths.
After truncation, the average read length is ~80 which is still a reasonable read length.
HiCUP mapper
The next step is to map the read pairs to the reference genome.
Forward and reverse reads are mapped independently, and then the resulting alignments are paired again to produce a paired end bam.
hicup_mapper --threads 1 --bowtie2 /data/software/bowtie2-2.3.5.1-linux-x86_64/bowtie2 --index maize_mini2 ZmEn_HiC_sub_1.trunc.fastq ZmEn_HiC_sub_2.trunc.fastq
We can inspect how many read pairs were correctly mapped in the hicup_mapper_summary file.
head hicup_mapper_summary_cKQMtNGDIb_18-27-31_26-10-2019.txt
HiCUP filter
After mapping, the resulting SAM file is parsed to filter out uninformative read pairs.
hicup_filter --digest Digest_maize_mini2_DpnII_None_02-07-29_01-11-2019.txt ZmEn_HiC_sub_1_2.pair.sam --longest 800 --shortest 150
HiCUP deduplicater
The final step of the workflow is to remove read pair duplicates. Duplicates may arise during the PCR protocol or in the sequencing step (optical duplicates). Removing duplicates is done by comparing the start and end coordinates of both reads of a read pair.
hicup_deduplicator --zip ZmEn_HiC_sub_1_2.filt.sam
Run whole HiCUP pipeline
An useful feature of HiCUP is that it can be run as a complete pipeline, producing a final html report.
To do this we setup a configuration file:
#Example configuration file for the hicup Perl script - edit as required
########################################################################
#Directory to which output files should be written (optional parameter)
#Set to current working directory by default
Outdir:
#Number of threads to use
Threads: 1
#Suppress progress updates (0: off, 1: on)
Quiet:0
#Retain intermediate pipeline files (0: off, 1: on)
Keep:0
#Compress outputfiles (0: off, 1: on)
Zip:1
#Path to the alignment program (Bowtie or Bowtie2)
#Remember to include the executable Bowtie/Bowtie2 filename.
#Note: ensure you specify the correct aligner i.e. Bowtie when
#using Bowtie indices, or Bowtie2 when using Bowtie2 indices.
#In the example below Bowtie2 is specified.
Bowtie2: /usr/local/src/bowtie2-2.3.5.1-linux-x86_64/bowtie2
#Path to the reference genome indices
#Remember to include the basename of the genome indices
Index: maize_mini2
#Path to the genome digest file produced by hicup_digester
Digest: Digest_maize_mini2_DpnII_None_02-07-29_01-11-2019.txt
#FASTQ format (valid formats: 'Sanger', 'Solexa_Illumina_1.0', 'Illumina_1.3' or 'Illumina_1.5')
#If not specified, HiCUP will try to determine the format automatically by analysing
#one of the FASTQ files. All input FASTQ will assumed to be in this format
Format:
#Maximum di-tag length (optional parameter)
Longest: 800
#Minimum di-tag length (optional parameter)
Shortest: 100
#FASTQ files to be analysed, placing paired files on adjacent lines
ZmEn_HiC_sub_1.fq.gz
ZmEn_HiC_sub_2.fq.gz
hicup --config hicup_config.txt
Finally let’s inspect the html output.
open ZmMC_HiC_2.1.10_sub_1_2.HiCUP_summary_report.html
For further information about HiCUP about you can access the following resources:
HiCUP documentation https://www.bioinformatics.babraham.ac.uk/projects/hicup/read_the_docs/html/index.html
Video explainer of HiCUP reports https://www.youtube.com/watch?v=xWpjlXnsOU4
HiCUP publication https://www.ncbi.nlm.nih.gov/pubmed/26835000