There are many pipelines in bioinformatics that require running a processing step in parallel and aggregating their outputs at the end for downstream analysis. A prominent example of this is bulk RNA-sequencing, where alignment is performed to produce transcript abundances per sample, and gene counts of all samples are subsequently merged. Having a single count matrix makes it convenient to use in downstream steps, such as differential gene expression analysis. Another example is performing FastQC on multiple samples and summarizing the results in a MultiQC report.
map_task
to help parallelize a
task across a list of inputs. This means you can run multiple instances of
the task at the same time inside a single workflow, providing valuable
performance gains.
Let’s look at a simple example below!
First, import map_task
into your workflow:
a_mappable_task
across a collection of inputs using the map_task
function. This function takes in a_mappable_task
and returns a mapped version of that task. This mapped version takes as input a list of inputs to a_mappable_task
, and returns a list of the outputs of a_mappable_task
run on all inputs in the list in parallel.
a_mappable_task
that is passed to a
map_task()
and run repeatedly on a list of inputs in parallel. You have also
defined a coalesce
task to collect the list of outputs from the mapped task
and returns a string.
base
input while the
dna_sequence
stays the same. Since a map task accepts only one input, we can
do this by creating a new task that prepares the map task’s inputs.
We start by putting the inputs in a Dataclass and dataclass_json
.
count_task
. Instead of 2 inputs, count_task
has a single input:
mappable_task
in our workflow:
count_wf
to spin up four tasks in
parallel. The map_task
returns a list of four floats, each of which is the
percentage of base pair in the DNA sequence.
LatchDir
s,
each of which contains an individual sample’s FastQC results, the
multiqc_task
needs to also accept a list of LatchDir
s.Sample
s and
returns a single directory with the MultiQC report: