Shared Storage
Overview
One of the main requirements of Nextflow workflows is having a shared, POSIX-compliant file system among all workflow tasks. All of the input files are downloaded and staged into a “workdir,” a mounted shared filesystem directory. The workflow tasks then access these files as part of their computation and write intermediate or output files back to the workdir.
A shared filesystem is required since the inputs to workflows might be extremely large, with multiple terabytes of data, and tasks must share files even if they are scheduled on different nodes in the cluster. Having a shared filesystem allows the tasks to write an unlimited amount of data without requesting a lot of storage resources.
Latch provides two options for shared storage when running Nextflow workflows: EFS and ObjectiveFS.
EFS
EFS is a shared file system with nearly unlimited storage capacity and high throughput which is offered by AWS. EFS is mounted into every process in Nextflow and can be accessed as any other directory.
EFS scales with the growing need for storage so it can support both small and large workloads without suffering performance degradation. EFS provides strong data consistency and file locking which is one of the requirements for Nextflow shared file systems.
OFS
ObjectiveFS(OFS) is a serverless shared filesystem built on top of AWS S3 as its storage layer. It has a different architecture from EFS as it processes file operations directly on the host and not on a set of central servers.
OFS filesystem is POSIX compliant and can scale up to 1 PB of data. OFS provides read-and-write consistency guarantees and the same durability and availability guarantees as AWS S3 while having high read-and-write performance and enforced data encryption.
Usage Examples
By default, all new workflows generated with latch init
will use OFS as an underlying storage. Follow the Nextflow Tutorial in order to generate a Nextflow project on Latch.
To configure your filesystem, you can change the initialize
method in wf/entrypoint.py
file.
initialize
step of the workflow will provision a shared filesystem. Here you can configure which filesystem you want to use for your workflow.
-
Use the URL of the request above to
http://nf-dispatcher-service.flyte.svc.cluster.local/provision-storage-ofs
to provision OFS storage or usehttp://nf-dispatcher-service.flyte.svc.cluster.local/provision-storage-efs
to provision EFS storage. -
You can specify multiple parameters to configure your shared filesystem for your workload.
storage_expiration_hours
option specifies when to clean up the data in your storage. Set to0
hours to delete storage after execution completes. Set this parameter to a non-zero value to keep storage for relaunching.- EFS storage expiration defaults to 0 hours.
- OFS storage expiration defaults to 30 days.
version
option specifies the version of Nextflow integration to use. For most cases, it should always be set to2
for new workflows unless you are running a legacy workflow built on version1
of Nextflow integration.fs_size_tb
(OFS only) - approximate expected size in TBs for the filesystem. OFS requires you to specify the filesystem size ahead of time to provision more memory for the task. See OFS task memory requirement for more details.
OFS task memory requirement
Unlike EFS, OFS is not an NFS filesystem and does not use an external server to process file requests. OFS runs as a FUSE process on every node in the cluster and mounts the filesystem into every workflow task. OFS uses node memory to store the file system index and local cache in order to speed up file operations and it requires each workflow task to request extra memory to properly account for OFS memory usage.
With the fs_size_tb
parameter you can specify an approximate storage size of your filesystem which will change the memory requirement for each workflow task. The following are the memory requirements for file system size:
Filesystem Size(TB) | Additional Task Memory Request(GB) |
---|---|
0-1 | 2 |
2-10 | 3 |
11-20 | 4 |
21-30 | 5 |
31-40 | 6 |
41-50 | 7 |
You can approximate the size of your filesystem by adding up all input file sizes and multiplying by 2 to account for intermediate files.
Comparison
Cost
The main difference between EFS and OFS is their cost models and the underlying storage layer.
EFS pricing model includes charges for data storage and data access. OFS uses S3 as its storage layer, so the pricing model includes charges for mounting OFS filesystems, S3 storage, and the additional RAM provisioned for the tasks.
EFS | ObjectiveFS | |
---|---|---|
Storage($/GB/month) | 0.30 | 0.023 |
Throughput Reads ($/GB/month) | 0.03 | N/A |
Throughput Writes ($/GB/month) | 0.06 | N/A |
Mount Cost($/mount/hr) | N/A | 0.18 |
RAM cost($/GiB/hr) | N/A | 0.009972 |
Performance
OFS has a better performance overall on 1 mount benchmarks. However, EFS throughput is higher in a Nextflow environment with many tasks reading and writting to the same file system due to slow distributed file locking in OFS. Both systems perform well on common Nextflow workloads.
EFS | ObjectiveFS | |
---|---|---|
Sequential Read(MB/s) | 92.55 | 122.23 |
Sequential Write(MB/s) | 124.20 | 125.07 |
Random Read(MB/s) | 58.39 | 77.79 |
Random Write(MB/s) | 73.69 | 87.90 |
Staging to workdir from LData(MB/s) | 212.31 | 188.40 |
Writing to LData from workdir(MB/s) | 246.64 | 208.74 |
Notes:
- To get the sequential read/write benchmarks, we measured the time to copy a 1GB file to and from the file system.
- To get the random read/write benchmarks, we measured the time to copy 1GB by randomly choosing 1MB chunks and writing to/from the file system.
- To get the staging benchmarks, we measured the total time to download the file to/from LData to filesystem and vice versa.
- OFS benchmarks were performed on pre-warmed cache
Choosing Shared Storage Option
The correct storage solution depends on your workload and budget requirements. Here is the summary of the file system comparison:
EFS | ObjectiveFS | |
---|---|---|
Cost | +++ | + |
Performance | +++ | ++ |
Execution time | + | ++ |
EFS:
Pros:
- Stable high throughput on Nextflow workloads
Cons:
- Expensive. The throughput and storage costs of EFS can be very significant depending on the input size and the workload.
- Storing data for relaunch can be expensive
OFS
Pros:
- Lower cost due to using S3. Does not have throughput charges.
- Good performance with pre-warmed cache
Cons:
- Throughput can vary more on Nextflow workloads
- Requires all tasks to provision extra memory
Summary
For most workloads, OFS is a better, more cost-effective option. OFS performs well on most workloads and allows for cheaper experiments. The executions are usually bound by the CPU processing time of the inputs therefore the decreased file system throughput in Nextflow environment doesn’t impact workflow performance significantly on most workloads.
Was this page helpful?