Bulk RNA-sequencing (RNA-seq) is a powerful technique for studying gene expression. By sequencing and analyzing RNA molecules within a sample—typically a collection of cells or tissues—RNA-seq allows researchers to identify and quantify the entire transcriptome. This approach enables the comparison of gene expression levels across different samples or conditions, making it possible to pinpoint differentially expressed genes that may be linked to specific biological processes or diseases.

Beyond identifying individual genes, bulk RNA-seq offers a comprehensive view of gene expression, uncovering entire pathways or networks of genes involved in various biological functions. Additionally, it can reveal novel transcripts or splice variants that may have been previously unannotated, providing deeper insights into the complexity of the transcriptome.

On this page, we will discuss important technical concepts needed during bioinformatics processing of Bulk RNA-seq data. These concepts will be discussed in order of an actual end-to-end run.

Quantification

Data Trimming

RNA sequencing data often contains unwanted adapter sequences and low-quality regions that need to be removed before analysis. Tools like TrimGalore! or fastp are used to trim these regions, ensuring that the data is clean and ready for accurate downstream analysis.

Data Filtering

After trimming, depending on the quality of the data, it may be important to filter out any remaining contaminants and irrelevant sequences. BBSplit can be employed to remove potential contaminants from other genomes, while SortMeRNA can be used to specifically eliminate ribosomal RNA.

Alignment and Quantification

Traditional Alignment and Quantification

Traditional alignment methods map RNA sequences (reads) to a reference genome, providing a precise, base-by-base alignment of each read to a specific location in the genome. This process is computationally intensive but results in a highly accurate mapping, which is crucial for downstream analyses that require detailed positional information. Methods implemented in the nf-core/rnaseq pipeline on Latch are:

1

STAR + Salmon (default):

STAR (Spliced Transcripts Alignment to a Reference) is a highly efficient aligner that performs full, base-by-base alignment of RNA sequences to the reference genome. STAR is specifically optimized for RNA-seq data, handling spliced reads effectively by mapping them across exon-exon junctions. This accurate mapping is crucial for identifying splice variants and other complex RNA features.

After alignment, Salmon is used to quantify transcript levels. Salmon takes the aligned reads (in BAM format) and estimates transcript abundances by considering the positional information provided by STAR. This approach ensures that the quantification is based on reads that are accurately mapped to specific transcript regions, providing reliable gene expression estimates.

2

STAR + RSEM:

Similar to the STAR + Salmon method, STAR performs the initial alignment. However, RSEM (RNA-Seq by Expectation-Maximization) is then used for transcript quantification. RSEM employs probabilistic models to assign reads to transcripts, taking into account uncertainties that arise from overlapping transcripts. This method provides detailed quantification at both the gene and isoform levels, making it particularly valuable for analyzing complex transcriptomes with multiple isoforms.

3

HISAT2 (No quantification):

HISAT2 is another aligner optimized for speed and memory efficiency. It uses a hierarchical indexing strategy to quickly align reads to the reference genome, making it well-suited for large datasets or environments with limited computational resources. However, in this workflow, HISAT2 is used without subsequent quantification, typically serving as a fast, standalone alignment tool for applications where only the alignment is needed.

Pseudo-alignment and Quantification

Pseudo-alignment methods offer an efficient alternative to traditional alignment by mapping reads to a set of transcripts they likely originate from, without assigning them to exact genomic positions. While this approach is faster and less computationally intensive, it still maintains a high level of accuracy in quantifying gene and transcript expression.

In practice, pseudo-alignment provides results comparable to traditional methods for many RNA-seq analyses, particularly when the focus is on expression quantification rather than detailed genomic features. This makes pseudo-alignment well-suited for large-scale studies or when computational resources are limited.

1

Salmon (Pseudo-alignment):

Salmon in its pseudo-alignment mode directly quantifies gene expression without performing a full alignment. It matches RNA sequences to a reference transcriptome by identifying k-mers (short, fixed-length sequences) that are unique to specific transcripts. This method efficiently determines the potential origin of each read based on these k-mers, quickly estimating transcript abundances.

2

Kallisto (Pseudo-alignment):

Kallisto operates similarly to Salmon, using a k-mer-based approach to match reads to a transcriptome. It constructs an index of k-mers from the reference transcripts and uses this index to rapidly assign reads to potential transcripts. Kallisto’s algorithm is highly efficient, allowing for fast quantification of gene expression across large datasets.

Alignment Post-Processing

After alignment, the data undergoes additional processing to ensure accuracy and reliability. SAMtools is used to sort and index the aligned sequences, making them more manageable for downstream analysis. UMI-tools and Picard MarkDuplicates are then employed to identify and mark duplicate reads. These duplicates can arise from true biological duplication, especially in highly expressed genes, or from PCR amplification during library preparation. In RNA-seq data, it is generally not recommended to remove these duplicates unless unique molecular identifiers (UMIs) are used, as they can represent true biological signals. The pipeline, therefore, marks duplicates to gauge the overall level of duplication, but does not remove them by default unless UMIs are used or the --skip_markduplicates parameter is explicitly specified.

Quality Control (QC) Statistics

To ensure the integrity and accuracy of the data, multiple quality control tools are applied:

  • FastQC provides an initial quality check on raw reads.
  • RSeQC and Qualimap offer metrics on RNA-seq data quality and alignment accuracy.
  • dupRadar assesses the level of duplication in the data.
  • Preseq estimates library complexity, helping predict how much sequencing would be needed for deeper coverage.
  • featureCounts measures read counts relative to gene biotypes.
  • DESeq2 generates PCA plots and pairwise distance heatmaps to assess sample similarity and quality.

Finally, MultiQC consolidates all QC results into a single, comprehensive report, making it easier to review and interpret the quality metrics.

Transcript Assembly and Quantification

StringTie is used to assemble transcripts from aligned RNA sequences and quantify their expression levels. It utilizes a novel network flow algorithm to reconstruct full-length transcripts, including multiple splice variants for each gene locus. StringTie also offers an optional de novo assembly step to identify novel transcripts.

Coverage File Creation

To visualize how the RNA sequences cover the genome, BEDTools and bedGraphToBigWig are used to generate coverage files. These files can be viewed in genome browsers, allowing researchers to assess the distribution and depth of sequencing coverage across the genome.

Refer to https://nf-co.re/rnaseq/3.14.0/docs/output/ for more information.

Differential Gene Expression

Input Matrix

The input matrix used for differential gene expression analysis is the .merged.gene_counts_length_scaled.tsv file from the nf-core/rnaseq pipeline. This matrix contains gene counts that have been bias-corrected and scaled by transcript length, making it well-suited for analyses where variations in transcript length across samples need to be accounted for.

Model Fitting

For differential gene expression analysis, DESeq2 is employed. The process involves the following steps:

  • Model Building: DESeq2 constructs a statistical model using the observed count data from the input matrix. This model captures the relationship between gene expression levels and the experimental conditions being studied.
  • Parameter Estimation: DESeq2 uses a method called maximum likelihood estimation to find the parameter values that best fit the observed data. These parameters describe the underlying distribution of gene expression levels.
  • Shrinkage Estimation: After parameter estimation, DESeq2 applies a Bayesian shrinkage technique to adjust the estimates of gene expression. Genes with low information content (i.e., genes with low expression or high variability) have their estimates pulled towards the overall average, while genes with high information content are adjusted minimally. This shrinkage process improves the stability and reliability of the differential expression estimates, making them more robust for downstream analysis.
  • Significance Testing: Finally, DESeq2 performs statistical tests to determine which genes are differentially expressed between conditions, accounting for multiple testing to control the false discovery rate.

Visualization

Several types of plots are generated to visualize the results of the differential gene expression analysis:

  • Volcano Plots: Volcano plots display the relationship between fold-change (fold-change refers to the ratio of expression levels between two conditions; it indicates how much a gene’s expression has increased or decreased) and statistical significance for each gene, highlighting those that are both highly differentially expressed and statistically significant. These plots help in identifying key genes of interest.
  • MA Plots: MA plots show the relationship between the average expression (mean) and the log2 fold-change for each gene. This visualization helps to identify systematic biases in the data and to see how expression levels vary across the experiment.
  • Heatmaps: Heatmaps are used to visualize the expression patterns of the top differentially expressed genes across all samples. They provide a clear overview of how gene expression varies between conditions, often revealing clusters of co-regulated genes.
  • PCA Plots: Principal Component Analysis (PCA) plots are generated to assess the overall similarity and variability between samples based on their gene expression profiles. These plots help to identify potential batch effects or outliers and to see how well samples group according to the experimental conditions.

Pathway Enrichment

Input Matrix

The input for pathway enrichment analysis is derived from the contrast files output from differential gene expression (DGE) results. These files contain log2 fold changes, p-values, and adjusted p-values (FDR) for each gene between the two conditions being compared in the contrast file.

Methods

The analysis employs two main methods to identify enriched pathways:

  • Gene Set Enrichment Analysis (GSEA): This method ranks all genes based on their expression changes and tests whether predefined gene sets (e.g., pathways) are overrepresented at the extremes (top or bottom) of this ranked list. GSEA is particularly useful for detecting subtle but coordinated changes in gene expression across a pathway.
  • Over Representation Analysis (ORA):
    • Upregulated ORA: Focuses on the subset of genes that are significantly upregulated, testing whether these genes are overrepresented in known biological pathways or functions.
    • Downregulated ORA: Similarly, this analysis identifies pathways that are significantly enriched with downregulated genes.

Databases

The analysis utilizes several key databases for pathway and function enrichment:

  • KEGG (Kyoto Encyclopedia of Genes and Genomes): A comprehensive resource that links genomic information with higher-order functional information, providing detailed pathway maps that are widely used in biological research.
  • Gene Ontology (GO): A standardized system that organizes genes into hierarchical categories based on three main aspects: biological process, molecular function, and cellular component. It allows for the annotation and analysis of gene functions across species.
  • Molecular Signatures Database (MSigDB): A collection of annotated gene sets designed for use with GSEA. It helps to identify and characterize biological processes and pathways that are enriched in specific experimental conditions.
  • WikiPathways: An open-source platform where pathways are curated by the scientific community. It offers a wide range of pathways that can be analyzed for gene enrichment.
  • Disease Ontology: A structured classification that standardizes the representation of diseases. It is used to link gene expression changes to disease-related pathways and processes.

Visualization

The results of the pathway enrichment analysis are visualized using various plot types to facilitate interpretation:

  • Dot Plots: Show the significance and gene ratio of enriched pathways.
  • Cnet Plots (Category Net Plots): Visualize the relationship between genes and pathways, showing how different genes contribute to the enrichment of multiple pathways.
  • Heat Plots: Highlight the expression patterns of key genes across different pathways, providing insights into how gene expression changes are coordinated within these pathways.
  • Tree Plots: Display hierarchical clustering of enriched pathways, revealing the similarity and relationships between different pathways based on gene overlap.
  • Emap Plots (Enrichment Map Plots): Show the network of enriched pathways, allowing for the visualization of pathway interconnections based on shared genes.
  • Bar Plots: Represent the top enriched pathways with bar height corresponding to significance or gene ratio.
  • Ridge Plots: Display the distribution of gene expression changes across pathways, highlighting the enrichment patterns for GSEA results.


Bulk RNA-seq Quantification Walkthrough

For detailed intructions on using Bulk RNA-Seq, read our tutorial here.