ATAC (assay for transposase-accessible chromatin) Sequencing is a powerful sequencing tool that has been used to study the epigenetic signature of genomes. ATAC seq aims to capture the open chromatin parts of the genome and throws light on the effect of chromatin packaging on gene expression and gene regulation. ATAC sequencing uses a hyperactive mutant of the TN5 transposase, which inserts itself into the open chromatin regions of the genome. Further, this transposase cleaves the double-stranded DNA and attaches adaptors to the fragments. These fragments are then purified, PCR-amplified, and sequenced using NGS. The workflow to process the raw ATAC sequencing reads is presented in the NFCore ATACseq workflow.

The workflow primarily aims at aligning reads against the genomes and computes a pile-up of reads at every base along the genome. The heart of the workflow is to clean up the alignments and compute “peaks”- more than expected pileup of reads along specific parts of the genome, likely to correspond to open chromatin portions of the genome. The workflow uses a series of tools and provides the user with various choices for every step in the process.

In this article, we present the steps in this pipeline and the set of software tools used to manipulate ATAC seq data.

Quality Control Analysis of Sequencing Reads

The ATAC Seq pipeline relies upon a set of tools to ensure data quality and remove any artifacts resulting from poor-quality reads.

  1. FastQC-FastQC performs read quality analysis by aggregating per base quality scores across reads and plots key metrics. This can be used to filter out reads and trim portions of the read of poor quality.
  2. Trim Galore! - Trimgalore is used in this analysis to remove adapter sequences from the sequencing reads.

Read Alignments

In this part of the workflow, we aim to align reads against the reference genome. The outputs from this step are key in identifying segments of the genome enriched in chromatin-accessible elements. The ATAC Seq pipelines offers 4 choices of short read aligners for aligning reads to the genome and the read alignment is one of the most compute-heavy portions of the workflow.

1

BWA:

BWA uses Burrows Wheeler transform to align reads against the genome. It is efficient and has a low memory footprint compared to other methods and this is the default aligner used by the workflow.

2

BowTie2:

BowTie2 is one of the most commonly used short-read aligners, which also uses Burrows Wheeler transform as an indexing kernel.

3

STAR:

STAR is another method for performing read alignments and relies on a suffix array data structure to compute alignments of short reads against the genome.

4

Chromap:

Chromap- Chromap is a relatively new, fast method for processing high-throughput ChIP and ATAC seq data. In addition to aligning, chromap, removes duplicates and performs a Tn5 shift, analyses specific to ChIP seq/ATAC seq data.

Post-processing read alignments

Filtering Alignments

Upon computing the alignments, the workflow uses PICARD to merge samples from the same library. The merged alignments obtained from the previous step are filtered to retain only alignments that are free of PCR duplicates, and poor-quality alignments. To that end, the workflow relies upon SAMtools, PiCard, and bedtools. The workflow uses PICARD to mark duplicates that are often a result of PCR amplification biases and if not removed can artificically boost the signal confidence. The reads that are marked as duplicates are removed from further analysis. The number of duplicates also serves as a method to gauge the library complexity. Further, the workflow uses SAMtools to remove reads aligning to the mitochondrial DNA, blacklisted portions of the genome that tend to attract the TN5 transposase, read alignments that are marked as secondary alignments, unmapped reads, reads containing over 4 mismatches, reads alignments that are soft-clipped, read pairs aligning to different chromosomes, and reads that do not align with concordant orientation.

Peak calling and peak annotation

The most important signal extracted from these filtered alignments is the ATAC Seq peak information. Since the library preparation step performs a targetted biased tagmentation of the open chromatin regions, the coverage of reads aligning to the genome will not be uniform. The coverage is expected to be spiky, with an increased percapita coverage to the chromatin-accessible parts of the genome. These segments are often referred to as “peaks” in literature and the rest of the workflow spins around computing, visualizing, and annotating the peaks. To that end,

The workflow relies upon Picard and preseq to estimate the library complexity and study the effect of PCR duplication on the library size

  1. BedTools to create bigWig files, that can be loaded with IGV to visualize coverage signals,
  2. Deeptools to generate gene-body meta-profile from the bigWig files and compute genome-wide enrichment of the ATAC seq reads. In case, the samples are annotated as cases and controls, the enrichment is calculated concerning controls.
  3. MACS2 is used to call peaks. MACS2 is a count-based peak calling software, that identifies sections of the genome that have a significant pile-up of reads compared to the background. MACS2 can either identify broad or narrow peaks, and by default, the workflow identifies broad peaks. Narrow peaks are regions where the concentration of the read coverage signal (e.g., binding sites of a transcription factor) is very sharp and localized. These peaks typically represent short regions of high signal intensity. On the other hand, broad peaks, represent regions where the signal is spread out over a broader genomic area. The enrichment is not as sharply defined as in narrow peaks and can cover a wider region of DNA. These are likely to correspond to histone modifications.
  4. HOMER is employed to annotate the peaks relative to gene features. Since the chromatin-accessible regions are often present closer to the gene body, annotating the peaks with the nearest gene is often useful for downstream tasks.
  5. The workflow also relies on BEDTools to merge peaks across all samples and create consensus peaks, which retain only the peaks that are found in all samples and remove peaks that are present in sample-specific. The set of consensus peaks establishes a basis for studying peaks that are used to perform downstream tasks such as differential expression analysis between cases and controls. To count reads in consensus peaks, the workflow uses the featureCounts package.
  6. The matrix of peak counts obtained from the previous step is then used to perform a “differential accessibility analysis”. Differential accessibility analysis is very similar in vein to differential abundance testing usually performed for gene counts. The same statistical machinery is extended to peaks in this context of ATAC Seq data. This is a powerful method to identify peaks that are differentially enriched between different samples and can also help detect potential batch effects, and ensure that biological replicates have similar peaks. The workflow uses DESeq2 to perform the differential accessibility analysis and projects the counts onto the top 2 PCs.
  7. ATAQV tool kit is used to make QC plots and renders the QC metrics as an HTML report.

While the analysis thus far focused on individual samples, if there are multiple replicates available, the workflow combines alignments from all the replicates and reruns all of the analysis on the merged replicates.

The workflow in addition to running all of the aforementioned analysis, also creates an interactive dashboard on Latch Plots. The workflow creates a registry table with the “run name”, and pulls all the data that needs to be plotted, into the registry table. This can then be loaded into the layout of the plot, and visualize the results from ATAC Seq QC, Differential accessibility analysis and, peaks. We use the R package, ATACseqQC, to calculate these metrics of interest.