Map Task
There are many pipelines in bioinformatics that require running a processing step in parallel and aggregating their outputs at the end for downstream analysis. A prominent example of this is bulk RNA-sequencing, where alignment is performed to produce transcript abundances per sample, and gene counts of all samples are subsequently merged. Having a single count matrix makes it convenient to use in downstream steps, such as differential gene expression analysis. Another example is performing FastQC on multiple samples and summarizing the results in a MultiQC report.
The Latch SDK introduces a construct called map_task
to help parallelize a
task across a list of inputs. This means you can run multiple instances of
the task at the same time inside a single workflow, providing valuable
performance gains.
Let’s look at a simple example below!
First, import map_task
into your workflow:
Next, define a task to use in the map task.
A map task can only accept one input and produce one output.
Let’s also define a task that collects the mapped output and returns a string:
We can run a_mappable_task
across a collection of inputs using the map_task
function. This function takes in a_mappable_task
and returns a mapped version of that task. This mapped version takes as input a list of inputs to a_mappable_task
, and returns a list of the outputs of a_mappable_task
run on all inputs in the list in parallel.
That’s it! You’ve successfully defined a_mappable_task
that is passed to a
map_task()
and run repeatedly on a list of inputs in parallel. You have also
defined a coalesce
task to collect the list of outputs from the mapped task
and returns a string.
Map a Task with Multiple Inputs
You may want to map a task with multiple inputs.
For example, the task below takes in 2 inputs, a base and a DNA sequence, and returns the percentage of that base in the sequence:
But we only want to map this task with the base
input while the
dna_sequence
stays the same. Since a map task accepts only one input, we can
do this by creating a new task that prepares the map task’s inputs.
We start by putting the inputs in a Dataclass and dataclass_json
.
Let’s also define our helper task to prepare the map task’s inputs.
We now refactor the original count_task
. Instead of 2 inputs, count_task
has a single input:
Let’s use the new mappable_task
in our workflow:
Great! Now, we are able to use the count_wf
to spin up four tasks in
parallel. The map_task
returns a list of four floats, each of which is the
percentage of base pair in the DNA sequence.
Bonus: Learning through a Biological Example
In the example below, we walk through a practical example of how we can use the map task construct to run FastQC on multiple samples and summarize their results in a MultiQC report.
First, we define a Dataclass that contains a sample name and its associated FastQ file:
Then, we create a task to run FastQC on a single sample and output the result under the FastQC Results folder on Latch.
Concept check: Note how this task will later be mapped across a list of samples. Therefore, the task is defined to accept one input and return one output.
Next, define a second task to run MultiQC on a given directory for analysis logs and compiles a HTML report.
Concept check: Because the map task will return a list of LatchDir
s,
each of which contains an individual sample’s FastQC results, the
multiqc_task
needs to also accept a list of LatchDir
s.
Finally, we can specify our workflow, which accepts a list of Sample
s and
returns a single directory with the MultiQC report:
Was this page helpful?