Snakemake exposes a storage interface that allows developers to write custom plugins to enable reading from and writing to custom data stores. The snakemake-storage-plugin-latch package is a plugin that allows snakemake to interact natively with Latch Data.

Overview

The storage plugin works by treating files on Latch as if they were files under a non-existence /ldata directory. For example, the file latch://123.account/a/b/c.txt would be represented internally as /ldata/123.account/a/b/c.txt.

This scheme allows common patterns such as Path(dir) / "file" to “just work” with Latch objects. When creating the config file, LatchFiles and LatchDirs are encoded as paths of this form.

Usage

Configuring your Snakefile to use this storage plugin is, in most cases, fairly straightforward. There are a few exceptional cases to keep in mind, but for the most part minimal edits are required. Following are a description of common cases where edits are required.

Using the {input} / {output} Wildcards

Firstly, ensure that there are no hardcoded paths in any shell commands. For example

rule test_storage:
    input:
        "hello.txt"
    output:
        os.path.join(config['remote_output_dir'], "hello.txt") # assume config['remote_output_dir'] is a path on Latch
    shell:
        "cp {input} {output}"

will copy the local file hello.txt onto Latch under remote_output_dir.

Note that in the example above, the shell command never explicitly references the output path, and instead references the {output} wildcard. This is intentional, and all rules that can reference Latch objects must use this pattern to function correctly.

Snakemake storage plugins in general work by doing all operations on a local copy of the remote file, then uploading the remote file back at the end of rule execution. In the example above, the {output} wildcard is replaced with the path of the local copy. This local copy is stored opaquely and its location can change at runtime depending on the way the pipeline is configured, so the only way to reliably reference it is by using the wildcard. This also applies to inputs and the {input} wildcard, for the exact same reason.

Remote Paths in the params: Directive

When using a remote path in params: directive, it is required that the path be marked with the storage(...) flag.

By default, Snakemake does not consider params: members as storage objects unless explicitly told to do so, hence file downloads / uploads will not happen. For this reason, every parm value that can be a remote storage object must be marked with storage(...). For example:

rule test_storage:
    input:
        "hello.txt"
    params:
        auxilliary = storage(config['auxilliary_file'])
    output:
        os.path.join(config['remote_output_dir'], "hello.txt") # assume config['remote_output_dir'] is a path on Latch
    shell:
        "cp {input} {output} && cp {input} {params.auxilliary}"

Using Filesystem APIs outside of Rules

Because remote paths aren’t local filesystem objects and instead remote identifiers, code that involves, for instance, reading the contents of a file, will not work outside of a rule. For cases like this, where pipeline execution depends on certain files being present outside of rules, we recommend explicitly downloading the files in the runtime task before calling snakemake.

Suppose, for example, you expand a wildcard based on which files are present in a specific input_dir. You can stage this input_dir ahead of time like below:

# Assume `input_dir` is a `LatchDir` parameter passed to `snakemake_runtime(...)`

print(f"Staging {input_dir.remote_path}...", flush=True)

# Need `from latch.ldata.path import LPath` at the top of file
input_dir = LPath(input_dir.remote_path).download(shared / "input_dir")

print("Done.")

config = {
    ...
    "input_dir": get_config_val(input_dir),
    ...
}

Note that after the directory is downloaded locally, instead of passing the remote identifier to the config object, we pass the local path of the downloaded directory.