When Snakemake workflows are executed locally on a single computer or high-performance cluster, all dependencies and input/ output files are on a single machine.
When a Snakemake workflow is executed on Latch, each generated job is run in a separate container on a potentially isolated machine.
Therefore, it may be necessary to adapt your Snakefile to address issues arising from this execution method, which were not encountered during local execution:
Here, we will walk through examples of each of the cases outlined above.
When a Snakemake workflow is executed on Latch, each generated job for the Snakefile rule is run on a separate machine. Only files and directories explicitly specified under the input
directive of the rule are downloaded in the task.
A typical example is if the index files for biological data are not explicitly specified as a Snakefile input, the generated job for that rule will fail due to the missing index files.
In the example below, there are two Snakefile rules:
delly_s
: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using bcftools
filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.delly_merge
: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.The above code will fail with the error:
The task failed because the BAM index files (ending with bcf.csi
) are produced by the delly_s
rule but is not explicitly specified as input to the delly_merge
rule. Hence, the index files are not downloaded into the task that executes the delly_merge
rule.
To resolve the error, we need to add the index files as the output of the delly_s
rule and the input of the delly_merge
rule:
Tasks at runtime will only download files their target rules explicitly depend on. Shared code, or Snakefile code that is not under any rule, will usually fail if it tries to read input files.
Since the Path("inputs").glob(...)
call is not under any rule, it runs in all tasks. Because the fastqc
rule does not specify input_dir
as an input
, it will not be downloaded and the code will throw an error.
Only access files when necessary (i.e. when computing dependencies as in the example, or in a rule body) by placing problematic code within rule definitions. Either directly inline the variable or write a function to use in place of the variable.
This works because the JIT step replaces input
, output
, params
, and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again.
Same example at runtime:
Example using multiple return values:
In a Snakemake workflow, each rule is executed on a separate, isolated machine. As a result, all input files specified for a rule are downloaded to the machine every time the rule is run. Frequent downloading of the same input files across multiple rules can lead to increased workflow runtime and higher costs, especially if the data files are large.
To optimize performance and minimize costs, it is recommended to consolidate the logic that relies on shared inputs into a single rule.
Instead of having separate rules processing the BAM file for marking duplicates, calling variants, and filtering variants, we consolidate the logic into a single rule, reducing redundant data downloads.
When Snakemake workflows are executed locally on a single computer or high-performance cluster, all dependencies and input/ output files are on a single machine.
When a Snakemake workflow is executed on Latch, each generated job is run in a separate container on a potentially isolated machine.
Therefore, it may be necessary to adapt your Snakefile to address issues arising from this execution method, which were not encountered during local execution:
Here, we will walk through examples of each of the cases outlined above.
When a Snakemake workflow is executed on Latch, each generated job for the Snakefile rule is run on a separate machine. Only files and directories explicitly specified under the input
directive of the rule are downloaded in the task.
A typical example is if the index files for biological data are not explicitly specified as a Snakefile input, the generated job for that rule will fail due to the missing index files.
In the example below, there are two Snakefile rules:
delly_s
: The rule runs Delly to call SVs and outputs an unfiltered BCF file, followed by quality filtering using bcftools
filter to retain only the SV calls that pass certain filters. Finally, it indexes the BCF file.delly_merge
: This rule merges or concatenates BCF files containing SV calls from the delly_s rule, producing a single VCF file. The rule requires the index file to be available for each corresponding BAM file.The above code will fail with the error:
The task failed because the BAM index files (ending with bcf.csi
) are produced by the delly_s
rule but is not explicitly specified as input to the delly_merge
rule. Hence, the index files are not downloaded into the task that executes the delly_merge
rule.
To resolve the error, we need to add the index files as the output of the delly_s
rule and the input of the delly_merge
rule:
Tasks at runtime will only download files their target rules explicitly depend on. Shared code, or Snakefile code that is not under any rule, will usually fail if it tries to read input files.
Since the Path("inputs").glob(...)
call is not under any rule, it runs in all tasks. Because the fastqc
rule does not specify input_dir
as an input
, it will not be downloaded and the code will throw an error.
Only access files when necessary (i.e. when computing dependencies as in the example, or in a rule body) by placing problematic code within rule definitions. Either directly inline the variable or write a function to use in place of the variable.
This works because the JIT step replaces input
, output
, params
, and other declarations with static strings for the runtime workflow so any function calls within them will be replaced with pre-computed strings and the Snakefile will not attempt to read the files again.
Same example at runtime:
Example using multiple return values:
In a Snakemake workflow, each rule is executed on a separate, isolated machine. As a result, all input files specified for a rule are downloaded to the machine every time the rule is run. Frequent downloading of the same input files across multiple rules can lead to increased workflow runtime and higher costs, especially if the data files are large.
To optimize performance and minimize costs, it is recommended to consolidate the logic that relies on shared inputs into a single rule.
Instead of having separate rules processing the BAM file for marking duplicates, calling variants, and filtering variants, we consolidate the logic into a single rule, reducing redundant data downloads.