The goal of this tutorial is to provide an example of the steps required to get an existing Snakemake workflow running on Latch.

To get started, first clone the starter repository:

$ git clone git@github.com:latchbio/snakemake-v2-tutorial.git

and ensure that you have latch installed:

$ pip install latch==2.55.0.a6

1. Updating the Pipeline to use Latch Storage

For now we will not need to make any edits to the Snakefile to make it work with Latch Storage. We will revisit this later in the tutorial however.

2. Adding Resources + Containers to each Rule

The first actual step we need to do is to specify how large each job’s machine needs to be. Since this is a relatively low footprint pipeline, we can make each machine small and provide 1 core and 2 GiB of RAM.

Because every rule has the same resource requirments, we can use a profile to specify them all at once, instead of having to update every rule individually.

Create a directory called profiles/default and in it touch a file called config.yaml:

$ mkdir -p profiles/default
$ touch profiles/default/config.yaml

Then, add the following YAML content to the config.yaml:

default-resources:
  cpu: 1
  mem_mib: 2048

This will set the resources for every rule. Note that you can override these for any rule by updating the resources of that rule directly.

We will skip creating containers for each rule; since all rules use the same conda environment, we can just install that environment in the docker image we make during latch register and have each rule run in that container instead.

3. Writing Metadata

Now, we need to write a metadata file that our workflow will use to generate its parameter interface.

First, make a directory called latch_metadata and in it touch a file called __init__.py:

$ mkdir latch_metadata
$ touch latch_metadata/__init__.py

In latch_metadata/__init__.py, create a SnakemakeV2Metadata object as below:

from latch.types.directory import LatchDir, LatchOutputDir
from latch.types.metadata.latch import LatchAuthor
from latch.types.metadata.snakemake import SnakemakeParameter
from latch.types.metadata.snakemake_v2 import SnakemakeV2Metadata

metadata = SnakemakeV2Metadata(
    display_name="Snakemake Tutorial Workflow",
    author=LatchAuthor(),
    parameters={},
)

This object still doesn’t have any parameter metadata yet, so we need to add it. Looking at config.yaml (not the file in profiles/default), we see that the pipeline expects 3 config parameters: samples_dir, genome_dir, and results_dir. The former two are inputs to the pipeline and the latter is the location where outputs will be stored.

We want all three of these to be exposed in the UI, so we will add them to the parameters dict in latch_metadata/__init__.py:

from latch.types.directory import LatchDir, LatchOutputDir
from latch.types.metadata.latch import LatchAuthor
from latch.types.metadata.snakemake import SnakemakeParameter
from latch.types.metadata.snakemake_v2 import SnakemakeV2Metadata

metadata = SnakemakeV2Metadata(
    display_name="Snakemake Tutorial Workflow",
    author=LatchAuthor(),
    parameters={
        "samples_dir": SnakemakeParameter(
            display_name="Sample Directory",
            type=LatchDir,
        ),
        "genome_dir": SnakemakeParameter(
            display_name="Genome Directory",
            type=LatchDir,
        ),
        "results_dir": SnakemakeParameter(
            display_name="Output Directory",
            type=LatchOutputDir,
        ),
    },
)

In each parameter, we specified (1) a human-readable name to display in the UI, and (2) the type of parameter to accept. Since the workflow expects all of these to be directories, they are all LatchDirs (we made results_dir a LatchOutputDir because it is an output directory).

For now, this is all we need and we can move on, but if you like feel free to customize the metadata object further using the interface described here.

4. Generating the Entrypoint

Now we need to generate the entrypoint file containing the Latch workflow wrapping our Snakemake workflow. This is a simple command:

$ latch snakemake generate-entrypoint .

This should create a directory called wf containing a file called entrypoint.py. The file should have the following contents:

5. Generating the Dockerfile

The last step pre-registering is to generate the Dockerfile that will define the environment the runtime executes in. In particular, we want that environment to contain the conda environment defined by environment.yaml.

Again, we can accomplish this with a simple command:

$ latch dockerfile --snakemake -c environment.yaml . -f

This will generate a file called Dockerfile with the following contents:

6. Registering and Running your Pipeline

Finally, we get to upload our pipeline to Latch. Simply run

$ latch register -y .

To run on Latch, you will also need to upload the test data. This is straightforward using latch cp:

$ latch cp data latch:///snakemake-tutorial-data

This will upload the data to a folder called snakemake-tutorial-data in your account on Latch.

Finally, navigate to Workflows and click on “Snakemake Tutorial Workflow”, select the parameters from the data you just uploaded, and run the workflow!

Appendix 1. Getting Sample Names Dynamically

You may have noticed that in the Snakefile, the sample names are hardcoded. This is obviously not desirable - we should be able to infer the sample names based on the contents of the Sample directory.

In order to accomplish this, we will need to edit both the Snakefile, and the entrypoint itself. Since we need to know the contents of the Sample directory outside of a rule, we will need to stage it locally before the pipeline executes.

First, add the following import to the top of the wf/entrypoint.py file:

from latch.ldata.path import LPath

Next, edit the start of snakemake_runtime(...) so that it is the following:

@snakemake_runtime_task(cpu=1, memory=2, storage_gib=50)
def snakemake_runtime(
    pvc_name: str,
    samples_dir: LatchDir,
    genome_dir: LatchDir,
    results_dir: LatchOutputDir,
):
    print(f"Using shared filesystem: {pvc_name}")

    shared = Path("/snakemake-workdir")
    snakefile = shared / "Snakefile"

    # Staging samples_dir
    local_samples_dir = LPath(samples_dir.remote_path).download(shared / "samples")

    config = {
        "samples_dir": get_config_val(local_samples_dir),
        "genome_dir": get_config_val(genome_dir),
        "results_dir": get_config_val(results_dir),
    }

    ...

Here we explicitly download the samples_dir before calling snakemake - this way we will know the contents of the directory without needing to be in a rule.

Lastly, we will need to edit the Snakefile and remove the hardcoded samples:

# Replace SAMPLES = ["A", "B"] with the following:

SAMPLES = []
for sample in samples_dir.iterdir():
    SAMPLES.append(sample.stem)

Now just re-register and see all 3 samples be run through the pipeline.