Quality Control Analysis of Sequencing Reads
The ATAC Seq pipeline relies upon a set of tools to ensure data quality and remove any artifacts resulting from poor-quality reads.- FastQC-FastQC performs read quality analysis by aggregating per base quality scores across reads and plots key metrics. This can be used to filter out reads and trim portions of the read of poor quality.
- Trim Galore! - Trimgalore is used in this analysis to remove adapter sequences from the sequencing reads.
Read Alignments
In this part of the workflow, we aim to align reads against the reference genome. The outputs from this step are key in identifying segments of the genome enriched in chromatin-accessible elements. The ATAC Seq pipelines offers 4 choices of short read aligners for aligning reads to the genome and the read alignment is one of the most compute-heavy portions of the workflow.1
BWA:
BWA uses Burrows Wheeler transform to align reads against the genome. It is efficient and has a low memory footprint compared to other methods and this is the default aligner used by the workflow.
2
BowTie2:
BowTie2 is one of the most commonly used short-read aligners, which also uses Burrows Wheeler transform as an indexing kernel.
3
STAR:
STAR is another method for performing read alignments and relies on a suffix array data structure to compute alignments of short reads against the genome.
4
Chromap:
Chromap- Chromap is a relatively new, fast method for processing high-throughput ChIP and ATAC seq data. In addition to aligning, chromap, removes duplicates and performs a Tn5 shift, analyses specific to ChIP seq/ATAC seq data.
Post-processing read alignments
Filtering Alignments
Upon computing the alignments, the workflow uses PICARD to merge samples from the same library. The merged alignments obtained from the previous step are filtered to retain only alignments that are free of PCR duplicates, and poor-quality alignments. To that end, the workflow relies upon SAMtools, PiCard, and bedtools. The workflow uses PICARD to mark duplicates that are often a result of PCR amplification biases and if not removed can artificically boost the signal confidence. The reads that are marked as duplicates are removed from further analysis. The number of duplicates also serves as a method to gauge the library complexity. Further, the workflow uses SAMtools to remove reads aligning to the mitochondrial DNA, blacklisted portions of the genome that tend to attract the TN5 transposase, read alignments that are marked as secondary alignments, unmapped reads, reads containing over 4 mismatches, reads alignments that are soft-clipped, read pairs aligning to different chromosomes, and reads that do not align with concordant orientation.Peak calling and peak annotation
The most important signal extracted from these filtered alignments is the ATAC Seq peak information. Since the library preparation step performs a targetted biased tagmentation of the open chromatin regions, the coverage of reads aligning to the genome will not be uniform. The coverage is expected to be spiky, with an increased percapita coverage to the chromatin-accessible parts of the genome. These segments are often referred to as “peaks” in literature and the rest of the workflow spins around computing, visualizing, and annotating the peaks. To that end, The workflow relies upon Picard and preseq to estimate the library complexity and study the effect of PCR duplication on the library size- BedTools to create bigWig files, that can be loaded with IGV to visualize coverage signals,
- Deeptools to generate gene-body meta-profile from the bigWig files and compute genome-wide enrichment of the ATAC seq reads. In case, the samples are annotated as cases and controls, the enrichment is calculated concerning controls.
- MACS2 is used to call peaks. MACS2 is a count-based peak calling software, that identifies sections of the genome that have a significant pile-up of reads compared to the background. MACS2 can either identify broad or narrow peaks, and by default, the workflow identifies broad peaks. Narrow peaks are regions where the concentration of the read coverage signal (e.g., binding sites of a transcription factor) is very sharp and localized. These peaks typically represent short regions of high signal intensity. On the other hand, broad peaks, represent regions where the signal is spread out over a broader genomic area. The enrichment is not as sharply defined as in narrow peaks and can cover a wider region of DNA. These are likely to correspond to histone modifications.
- HOMER is employed to annotate the peaks relative to gene features. Since the chromatin-accessible regions are often present closer to the gene body, annotating the peaks with the nearest gene is often useful for downstream tasks.
- The workflow also relies on BEDTools to merge peaks across all samples and create consensus peaks, which retain only the peaks that are found in all samples and remove peaks that are present in sample-specific. The set of consensus peaks establishes a basis for studying peaks that are used to perform downstream tasks such as differential expression analysis between cases and controls. To count reads in consensus peaks, the workflow uses the featureCounts package.
- The matrix of peak counts obtained from the previous step is then used to perform a “differential accessibility analysis”. Differential accessibility analysis is very similar in vein to differential abundance testing usually performed for gene counts. The same statistical machinery is extended to peaks in this context of ATAC Seq data. This is a powerful method to identify peaks that are differentially enriched between different samples and can also help detect potential batch effects, and ensure that biological replicates have similar peaks. The workflow uses DESeq2 to perform the differential accessibility analysis and projects the counts onto the top 2 PCs.
- ATAQV tool kit is used to make QC plots and renders the QC metrics as an HTML report.