Bulk RNA-sequencing (RNA-seq) is a powerful technique for studying gene expression. By sequencing and analyzing RNA molecules within a sample—typically a collection of cells or tissues—RNA-seq allows researchers to identify and quantify the entire transcriptome. This approach enables the comparison of gene expression levels across different samples or conditions, making it possible to pinpoint differentially expressed genes that may be linked to specific biological processes or diseases.
Beyond identifying individual genes, bulk RNA-seq offers a comprehensive view of gene expression, uncovering entire pathways or networks of genes involved in various biological functions. Additionally, it can reveal novel transcripts or splice variants that may have been previously unannotated, providing deeper insights into the complexity of the transcriptome.
On this page, we will discuss important technical concepts needed during bioinformatics processing of Bulk RNA-seq data. These concepts will be discussed in order of an actual end-to-end run.
RNA sequencing data often contains unwanted adapter sequences and low-quality regions that need to be removed before analysis. Tools like TrimGalore! or fastp are used to trim these regions, ensuring that the data is clean and ready for accurate downstream analysis.
After trimming, depending on the quality of the data, it may be important to filter out any remaining contaminants and irrelevant sequences. BBSplit can be employed to remove potential contaminants from other genomes, while SortMeRNA can be used to specifically eliminate ribosomal RNA.
Traditional alignment methods map RNA sequences (reads) to a reference genome, providing a precise, base-by-base alignment of each read to a specific location in the genome. This process is computationally intensive but results in a highly accurate mapping, which is crucial for downstream analyses that require detailed positional information. Methods implemented in the nf-core/rnaseq pipeline on Latch are:
STAR + Salmon (default):
STAR (Spliced Transcripts Alignment to a Reference) is a highly efficient aligner that performs full, base-by-base alignment of RNA sequences to the reference genome. STAR is specifically optimized for RNA-seq data, handling spliced reads effectively by mapping them across exon-exon junctions. This accurate mapping is crucial for identifying splice variants and other complex RNA features.
After alignment, Salmon is used to quantify transcript levels. Salmon takes the aligned reads (in BAM format) and estimates transcript abundances by considering the positional information provided by STAR. This approach ensures that the quantification is based on reads that are accurately mapped to specific transcript regions, providing reliable gene expression estimates.
STAR + RSEM:
Similar to the STAR + Salmon method, STAR performs the initial alignment. However, RSEM (RNA-Seq by Expectation-Maximization) is then used for transcript quantification. RSEM employs probabilistic models to assign reads to transcripts, taking into account uncertainties that arise from overlapping transcripts. This method provides detailed quantification at both the gene and isoform levels, making it particularly valuable for analyzing complex transcriptomes with multiple isoforms.
HISAT2 (No quantification):
HISAT2 is another aligner optimized for speed and memory efficiency. It uses a hierarchical indexing strategy to quickly align reads to the reference genome, making it well-suited for large datasets or environments with limited computational resources. However, in this workflow, HISAT2 is used without subsequent quantification, typically serving as a fast, standalone alignment tool for applications where only the alignment is needed.
Pseudo-alignment methods offer an efficient alternative to traditional alignment by mapping reads to a set of transcripts they likely originate from, without assigning them to exact genomic positions. While this approach is faster and less computationally intensive, it still maintains a high level of accuracy in quantifying gene and transcript expression.
In practice, pseudo-alignment provides results comparable to traditional methods for many RNA-seq analyses, particularly when the focus is on expression quantification rather than detailed genomic features. This makes pseudo-alignment well-suited for large-scale studies or when computational resources are limited.
Salmon (Pseudo-alignment):
Salmon in its pseudo-alignment mode directly quantifies gene expression without performing a full alignment. It matches RNA sequences to a reference transcriptome by identifying k-mers (short, fixed-length sequences) that are unique to specific transcripts. This method efficiently determines the potential origin of each read based on these k-mers, quickly estimating transcript abundances.
Kallisto (Pseudo-alignment):
Kallisto operates similarly to Salmon, using a k-mer-based approach to match reads to a transcriptome. It constructs an index of k-mers from the reference transcripts and uses this index to rapidly assign reads to potential transcripts. Kallisto’s algorithm is highly efficient, allowing for fast quantification of gene expression across large datasets.
After alignment, the data undergoes additional processing to ensure accuracy and reliability.
SAMtools is used to sort and index the aligned sequences, making them more manageable for downstream analysis.
UMI-tools and Picard MarkDuplicates are then employed to identify and mark duplicate reads. These duplicates can arise from true biological duplication, especially in highly expressed genes, or from PCR amplification during library preparation.
In RNA-seq data, it is generally not recommended to remove these duplicates unless unique molecular identifiers (UMIs) are used, as they can represent true biological signals. The pipeline, therefore, marks duplicates to gauge the overall level of duplication, but does not remove them by default unless UMIs are used or the --skip_markduplicates parameter
is explicitly specified.
To ensure the integrity and accuracy of the data, multiple quality control tools are applied:
Finally, MultiQC consolidates all QC results into a single, comprehensive report, making it easier to review and interpret the quality metrics.
StringTie is used to assemble transcripts from aligned RNA sequences and quantify their expression levels. It utilizes a novel network flow algorithm to reconstruct full-length transcripts, including multiple splice variants for each gene locus. StringTie also offers an optional de novo assembly step to identify novel transcripts.
To visualize how the RNA sequences cover the genome, BEDTools and bedGraphToBigWig are used to generate coverage files. These files can be viewed in genome browsers, allowing researchers to assess the distribution and depth of sequencing coverage across the genome.
Refer to https://nf-co.re/rnaseq/3.14.0/docs/output/ for more information.
The input matrix used for differential gene expression analysis is the .merged.gene_counts_length_scaled.tsv file from the nf-core/rnaseq pipeline. This matrix contains gene counts that have been bias-corrected and scaled by transcript length, making it well-suited for analyses where variations in transcript length across samples need to be accounted for.
For differential gene expression analysis, DESeq2 is employed. The process involves the following steps:
Several types of plots are generated to visualize the results of the differential gene expression analysis:
The input for pathway enrichment analysis is derived from the contrast files output from differential gene expression (DGE) results. These files contain log2 fold changes, p-values, and adjusted p-values (FDR) for each gene between the two conditions being compared in the contrast file.
The analysis employs two main methods to identify enriched pathways:
The analysis utilizes several key databases for pathway and function enrichment:
The results of the pathway enrichment analysis are visualized using various plot types to facilitate interpretation:
For detailed intructions on using Bulk RNA-Seq, read our tutorial here.
Bulk RNA-sequencing (RNA-seq) is a powerful technique for studying gene expression. By sequencing and analyzing RNA molecules within a sample—typically a collection of cells or tissues—RNA-seq allows researchers to identify and quantify the entire transcriptome. This approach enables the comparison of gene expression levels across different samples or conditions, making it possible to pinpoint differentially expressed genes that may be linked to specific biological processes or diseases.
Beyond identifying individual genes, bulk RNA-seq offers a comprehensive view of gene expression, uncovering entire pathways or networks of genes involved in various biological functions. Additionally, it can reveal novel transcripts or splice variants that may have been previously unannotated, providing deeper insights into the complexity of the transcriptome.
On this page, we will discuss important technical concepts needed during bioinformatics processing of Bulk RNA-seq data. These concepts will be discussed in order of an actual end-to-end run.
RNA sequencing data often contains unwanted adapter sequences and low-quality regions that need to be removed before analysis. Tools like TrimGalore! or fastp are used to trim these regions, ensuring that the data is clean and ready for accurate downstream analysis.
After trimming, depending on the quality of the data, it may be important to filter out any remaining contaminants and irrelevant sequences. BBSplit can be employed to remove potential contaminants from other genomes, while SortMeRNA can be used to specifically eliminate ribosomal RNA.
Traditional alignment methods map RNA sequences (reads) to a reference genome, providing a precise, base-by-base alignment of each read to a specific location in the genome. This process is computationally intensive but results in a highly accurate mapping, which is crucial for downstream analyses that require detailed positional information. Methods implemented in the nf-core/rnaseq pipeline on Latch are:
STAR + Salmon (default):
STAR (Spliced Transcripts Alignment to a Reference) is a highly efficient aligner that performs full, base-by-base alignment of RNA sequences to the reference genome. STAR is specifically optimized for RNA-seq data, handling spliced reads effectively by mapping them across exon-exon junctions. This accurate mapping is crucial for identifying splice variants and other complex RNA features.
After alignment, Salmon is used to quantify transcript levels. Salmon takes the aligned reads (in BAM format) and estimates transcript abundances by considering the positional information provided by STAR. This approach ensures that the quantification is based on reads that are accurately mapped to specific transcript regions, providing reliable gene expression estimates.
STAR + RSEM:
Similar to the STAR + Salmon method, STAR performs the initial alignment. However, RSEM (RNA-Seq by Expectation-Maximization) is then used for transcript quantification. RSEM employs probabilistic models to assign reads to transcripts, taking into account uncertainties that arise from overlapping transcripts. This method provides detailed quantification at both the gene and isoform levels, making it particularly valuable for analyzing complex transcriptomes with multiple isoforms.
HISAT2 (No quantification):
HISAT2 is another aligner optimized for speed and memory efficiency. It uses a hierarchical indexing strategy to quickly align reads to the reference genome, making it well-suited for large datasets or environments with limited computational resources. However, in this workflow, HISAT2 is used without subsequent quantification, typically serving as a fast, standalone alignment tool for applications where only the alignment is needed.
Pseudo-alignment methods offer an efficient alternative to traditional alignment by mapping reads to a set of transcripts they likely originate from, without assigning them to exact genomic positions. While this approach is faster and less computationally intensive, it still maintains a high level of accuracy in quantifying gene and transcript expression.
In practice, pseudo-alignment provides results comparable to traditional methods for many RNA-seq analyses, particularly when the focus is on expression quantification rather than detailed genomic features. This makes pseudo-alignment well-suited for large-scale studies or when computational resources are limited.
Salmon (Pseudo-alignment):
Salmon in its pseudo-alignment mode directly quantifies gene expression without performing a full alignment. It matches RNA sequences to a reference transcriptome by identifying k-mers (short, fixed-length sequences) that are unique to specific transcripts. This method efficiently determines the potential origin of each read based on these k-mers, quickly estimating transcript abundances.
Kallisto (Pseudo-alignment):
Kallisto operates similarly to Salmon, using a k-mer-based approach to match reads to a transcriptome. It constructs an index of k-mers from the reference transcripts and uses this index to rapidly assign reads to potential transcripts. Kallisto’s algorithm is highly efficient, allowing for fast quantification of gene expression across large datasets.
After alignment, the data undergoes additional processing to ensure accuracy and reliability.
SAMtools is used to sort and index the aligned sequences, making them more manageable for downstream analysis.
UMI-tools and Picard MarkDuplicates are then employed to identify and mark duplicate reads. These duplicates can arise from true biological duplication, especially in highly expressed genes, or from PCR amplification during library preparation.
In RNA-seq data, it is generally not recommended to remove these duplicates unless unique molecular identifiers (UMIs) are used, as they can represent true biological signals. The pipeline, therefore, marks duplicates to gauge the overall level of duplication, but does not remove them by default unless UMIs are used or the --skip_markduplicates parameter
is explicitly specified.
To ensure the integrity and accuracy of the data, multiple quality control tools are applied:
Finally, MultiQC consolidates all QC results into a single, comprehensive report, making it easier to review and interpret the quality metrics.
StringTie is used to assemble transcripts from aligned RNA sequences and quantify their expression levels. It utilizes a novel network flow algorithm to reconstruct full-length transcripts, including multiple splice variants for each gene locus. StringTie also offers an optional de novo assembly step to identify novel transcripts.
To visualize how the RNA sequences cover the genome, BEDTools and bedGraphToBigWig are used to generate coverage files. These files can be viewed in genome browsers, allowing researchers to assess the distribution and depth of sequencing coverage across the genome.
Refer to https://nf-co.re/rnaseq/3.14.0/docs/output/ for more information.
The input matrix used for differential gene expression analysis is the .merged.gene_counts_length_scaled.tsv file from the nf-core/rnaseq pipeline. This matrix contains gene counts that have been bias-corrected and scaled by transcript length, making it well-suited for analyses where variations in transcript length across samples need to be accounted for.
For differential gene expression analysis, DESeq2 is employed. The process involves the following steps:
Several types of plots are generated to visualize the results of the differential gene expression analysis:
The input for pathway enrichment analysis is derived from the contrast files output from differential gene expression (DGE) results. These files contain log2 fold changes, p-values, and adjusted p-values (FDR) for each gene between the two conditions being compared in the contrast file.
The analysis employs two main methods to identify enriched pathways:
The analysis utilizes several key databases for pathway and function enrichment:
The results of the pathway enrichment analysis are visualized using various plot types to facilitate interpretation:
For detailed intructions on using Bulk RNA-Seq, read our tutorial here.