How to run MAGeCK Count on Latch

  1. Find MAGeCK Count in your Workspace
    1. Find MAGeCK Count in “All Workflows” and open the workflow
  2. Enter the parameters for MAGeCK Count
    1. First add your Sample Labels, the labels you add should correspond to the Sample FastQ files you will give MAGeCK. Ex. “L1”, “CTRL”
    2. Then add your Sample FastQ file, these should correspond order wise to the Sample Labels you gave in the previous step.
      1. If you have technical replicates for a sample, add them within the same box.
    3. Then select your List Sequence File.
      1. You can learn more about the format of this file below.
    4. Then fill out the Output Prefix and Output Location and click Launch Workflow.
  3. Within no time your results will show up in the Data tab!

FYI

  • If you want to run multiple executions of this workflow click the large plus button at the bottom of the parameters to add an additional execution of the count workflow.

  • We have hidden many of the optional parameters under Hidden Parameters, you can click that if you would like to fine tune your execution run or want to use any of the advanced parameters.

Required Parameters

Sample Labels

  • The labels of each sample, these labels will be used to specify whether the samples are treatment or control in later MAGeCK steps
  • This defaults to sample1, sample2, etc… but would recommend specifying them yourself in this step as they are needed in later subcommands.

Fastq reads for each sample

  • The reads for each sample and should correspond to each sample label, each sample can have Technical Replicates added as well.
  • Accepted Files
    • fastq
    • fastq.gz
    • SAM/BAM

FYIs for Fastqs

  • If the sample reads are paired ends, the 2nd Fastq can be added to 2nd FastQ for Paired End Reads in the hidden parameters accordion. The files given here must correspond orderwise to the files given in the first Fastq parameter.
  • If you have Biological Replicates treat them as separate samples in MAGeCK Count and then you will be able to specify them as such in the MAGeCK Test and MAGeCK MLE steps when doing analysis.

List Sequence File

  • A file containing the list of sgRNA names, their sequences and associated genes. When starting from FASTQ, FASTQ.GZ or BAM files, MAGeCK needs to know the sgRNA sequences and targeting genes.
  • Accepted Files
    • .tsv
    • .csv
    • .txt with tab or comma separated values
  • Example: - You can download (right click to download) an example txt tab separated library file here:

    library.txt

    87.7KB

    • There are three columns in the library file: the sgRNA ID, the sequence, and the gene it is targeting.
sgRNA IDSequenceGene
s_10007TGTTCACAGTATAGTTTGCCCCNA1
s_10008TTCTCCCTAATTGCTTGCTGCCNA1
s_10027ACATGTTGCTTCCCCTTGCACCNC

Output Prefix

  • The prefix appended to all of the outputted files

Output Location

  • The directory where the files produced by this subcommand will be placed. A path can either be selected or if a new path is typed in field Latch will automatically create the folders in the data viewer.

Hidden Parameters

FastQ Parameters

2nd FastQs for Paired End Reads

  • The 2nd fastqs for paired end reads for each sample. These should correspond to each sample label, and each sample can have technical replicates added as well.

Quality Control Parameters

Day Zero Label

  • Specifying this will turn on the negative selection quality control and specify the label as the control sample (usually day 0 or plasmid). For every other sample label, the negative selection quality control will compare it with day0 sample, and estimate the degree of negative selections in essential genes.

Length of 5’ End Read Trimming

  • Length of trimming the 5’ of the reads. Default 0

Disable Discarding of sgRNAs Containing ‘N’ in FastQ Reads

  • Enabling this will count sgRNAs with Ns. By default, sgRNAs containing Ns are discarded by MAGeCK
  • It is expected that the first few thousand reads in an Illumina sequence fastq file are of comparatively low quality and frequently contain “N”s. An “N” means that the Illumina software was not able to make a basecall for this base. The reads at the beginning and end of the sequence data files originate from the edges of the flowcells, where imaging is more difficult, thus these reads show below average quality which is why MAGeCK by default discards them.

Reverse Complement the Sequences in Library for Read Mapping

  • Enabling this has MAGeCK reverse complement the sequences in the library for read mapping. By default read mapping will be performed with the sequences as they are in the library.

Method for Normalization

  • By default MAGeCK will use Median normalization.
  • Options:
    • None: no normalization
    • Median: median normalization, default
    • Total: normalization by total read counts
    • Control: normalization by control sgRNAs specified by the Control sgRNAs option. The median factor used for normalization will be calculated based on control sgRNAs only, rather than all the sgRNAs

Control sgRNAs

  • A list of control sgRNAs for normalization and for generating the null distribution of RRA. Alternatively Control Genes can be specified instead of this parameter. This option tells MAGeCK to use provided negative control sgRNAs to generate the null distribution when calculating the p values. By providing the corresponding sgRNA IDs in this parameter, MAGeCK will have a better estimation of p values.
  • When using this option, you will need to provide a plain text file just containing negative control sgRNA IDS (one per each line). For example,
NonTargetingControlGuideForHuman_0001
NonTargetingControlGuideForHuman_0002
NonTargetingControlGuideForHuman_0003
NonTargetingControlGuideForHuman_0004

Control Genes

  • A list of genes whose sgRNAs are used as control sgRNAs for normalization and for generating the null distribution of RRA. Alternatively Control sgRNA can be specified instead of this parameter. There are several issues that you need to keep in mind:
    • You should have enough number of negative control guides (>100 recommended) for accurate p value estimation and normalization.
    • It is known that for growth based screens, non-targeting controls may lead to high false positives (e.g., Morgens et al. 2017. Use non-targeting controls carefully.
  • By default MAGeCK will generate the null distribution of RRA scores by assuming all of the genes in the library are non-essential. This approach is sometimes over-conservative, and you can improve this if you know some genes are not essential.

Use Custom Pathway File For Quality Control (GMT Format)

  • The pathway file used for QC, in GMT format. By default it will use the GMT file provided by MAGeCK (mageckQC.gmt.
  • More information about GMT format can be found here and a repository for pathway files can be found here.

Output Settings

sgRNA Length

  • The length of the sgRNA. The program will automatically determine the sgRNA length from library file, so this parameter should likely be toggled off. If toggled on and given an sgRNA length, will put umapped reads to a file for viewing.

Keep Intermediate Files

  • Keeps intermediate files for this subcommand which are the .r, .rmb,.rnw files which can brought into an R software environment to plot the results of the execution

Run Settings

Test Run Using First 1M Records For Each File

Outputs

count.txt

  • A tab-separated count table, each line in the table should will the sgRNA name (1st column), the targeting gene (2nd column) and the read counts in each sample. Each item will be separated by the tab (‘\t’).
  • For example in the studies of T. Wang et al. Science 2014, there are 4 CRISPR screening samples, and they are labeled as: HL60.initial, KBM7.initial, HL60.final, KBM7.final. Here are a few example lines of the count file:
sgRNAGeneHL60.initialKBM7.initialHL60.finalKBM7.final
A1CF_m52595977A1CF213274883175
A1CF_m52596017A1CF29441215541891
A1CF_m52596056A1CF421368566759
A1CF_m52603842A1CF274243314855
A1CF_m52603847A1CF050145266

count_normalized.txt

  • A normalized count file. Please forgive me as I’m not really sure what the significance of this file is, but will update this once I figure it out. Or if you can help explain this to me please contact me at nathan@latch.bio.

countsummary.txt

  • This file is generated by count command, and summarizes QC measurements of the fastq (or count table) files. Learn more about it from the MAGeCK Wiki.

countssummary.R

  • This file contains code that can be executed within the R software environment to plot the data from the count subcommand and create a PDF from it. This file can be used in a program such as RStudio.

countsummary.Rnw

  • This file is called by the counts summary.R file and has the specific code for plotting the results.

Log File

  • This file contains all of the logs of the execution. This file is mostly a bunch of techno gobbledygook but you can view it to view any errors the execution might have encountered.

What is MAGeCK

Model-based Analysis of Genome-wide CRISPR-Cas9 Knockout (MAGeCK is a computational tool to identify important genes from the recent genome-scale CRISPR-Cas9 knockout screens (or GeCKO) technology. MAGeCK can be used for prioritizing single-guide RNAs, genes and pathways in genome-scale CRISPR/Cas9 knockout screens. MAGeCK identifies both positively and negatively selected genes simultaneously and reports robust results across different experimental conditions. MAGeCK is developed and maintained by Wei Li and Han Xu from Prof. Xiaole Shirley Liu’s lab at the Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard School of Public Health. MAGeCK has been used to identify functional lncRNAs from screens with close to 100% validation rate.