Skip to main content

Publishing Datasets

After completing the curation pipeline, the publish commands help you build metadata, upload datasets to Latch Data, and notify paper authors.

Prerequisites

Before publishing, ensure you have:
  1. Completed the curation pipeline through harmonize-metadata
  2. Configuration files in ~/.latch/latch-curate/:
    • metadata_schema.yaml - Metadata harmonization schema
    • cell_typing_schema.yaml - Cell typing vocabulary
  3. Latch credentials in ~/.latch/:
    • token - Your Latch SDK token
    • workspace - Workspace ID (JSON or plaintext)

Setting Up Credentials

mkdir -p ~/.latch

# Token is automatically created when you run `latch login`
# Or manually create it:
echo "your-sdk-token" > ~/.latch/token

# Workspace ID (get from Latch Console settings)
echo "your-workspace-id" > ~/.latch/workspace

Setting Up Configuration Files

mkdir -p ~/.latch/latch-curate

# Copy the cell typing schema from the repo
cp cell_typing_schema.yaml ~/.latch/latch-curate/

# Create metadata schema (see Configuration Reference for format)
cat > ~/.latch/latch-curate/metadata_schema.yaml << 'EOF'
variables:
  - name: "disease"
    description: "disease or condition studied"
    vocab:
      type: "ontology"
      name: "mondo"
  - name: "tissue"
    description: "tissue or anatomical site"
    vocab:
      type: "ontology"
      name: "uberon"
  - name: "assay"
    description: "sequencing assay used"
    vocab:
      type: "ontology"
      name: "efo"
  - name: "sample_site"
    description: "sample collection site"
    vocab:
      type: "custom"
      values: ["tumor", "normal", "metastasis", "blood"]
EOF

Publish Workflow

Step 1: Build

Generate metadata and validate the curated dataset.
latch-curate publish build
This command:
  • Extracts paper title and abstract via API
  • Retrieves corresponding author contact information
  • Validates harmonized metadata against your schema
  • Validates cell typing against configured vocabulary
  • Extracts ontology tags (disease, tissue, assay, cell types)
  • Generates publish/build.yaml with all metadata
Required files:
  • download/paper_text.txt - Paper text or abstract
  • download/paper_url.txt - URL to the paper
  • download/external_id.txt - GEO accession ID
  • harmonize_metadata/harmonize_metadata.h5ad - Curated AnnData
Outputs:
  • publish/build.yaml - Build metadata file
  • publish/publish.h5ad - Final curated object
Example output:
Build complete! Please verify the following information:
============================================================
Paper Title: Single-cell analysis of human tissues
Paper Abstract: We performed single-cell RNA sequencing...
Cell Count: 45,231
Authors: Smith J, Jones A
Email Contacts: [email protected]
Metadata Validation Status: passed
Metadata Tags Extracted: 4
Cell Typing Validation Status: passed
Cell Typing Tags Extracted: 8
All Tags:
  - disease: Alzheimer's disease
  - tissue: brain
  - assay: 10x 3' v3
  - cell_type: neuron
  - cell_type: astrocyte
  ... and 6 more
============================================================

Step 2: Upload

Upload the dataset to Latch Data and register it in the data portal.
latch-curate publish upload
You will be prompted for:
  • Destination path: Where to store the dataset in Latch Data (e.g., latch:///datasets/)
  • Curator organization ID: Your organization’s ID in the system
  • Dataset version: Version string (e.g., v1.0.0)
  • Curator dataset ID: Unique identifier for this dataset (defaults to GEO ID)
Or provide options directly:
latch-curate publish upload \
  --latch-dest "latch:///curated-datasets/" \
  --curator-id 123 \
  --version "v1.0.0" \
  --curator-dataset-id "GSE252545"
What happens:
  1. Uploads publish/ directory to Latch Data
  2. Retrieves the ldata node ID for the uploaded files
  3. Registers the dataset with the data portal API
  4. Returns family ID and dataset ID on success
Example output:
Dataset Upload
Paper Title: Single-cell analysis of human tissues
Cell Count: 45,231
Validation Status: passed
Tags: 12 extracted
Uploading dataset...
Curator ID: 123
Version: v1.0.0
Dataset ID: GSE252545
Retrieved node ID 456789 for latch:///curated-datasets/GSE252545
Uploading dataset to active workspace 456789
Upload complete!
Family ID: 100
Dataset ID: 200

Step 3: Email (Optional)

Send notification emails to paper authors about the curated dataset.
latch-curate publish email
Prerequisites:
  • Email configuration at ~/.latch/latch-curate/email-info.json:
{
  "smtp_host": "smtp.example.com",
  "smtp_port": 587,
  "smtp_user": "[email protected]",
  "smtp_password": "your-password",
  "sender_addr": "[email protected]",
  "starttls": true,
  "timeout": 30
}

Troubleshooting

Missing configuration files

AssertionError (metadata_schema_path or cell_typing_config_path)
Ensure configuration files exist at ~/.latch/latch-curate/. See Configuration Reference for schema formats.

Missing pipeline files

AssertionError (paper_url_file, paper_text_file, etc.)
Run the full curation pipeline first, or create the required files manually:
mkdir -p download harmonize_metadata
echo "https://example.com/paper" > download/paper_url.txt
echo "Paper text here..." > download/paper_text.txt
echo "GSE12345" > download/external_id.txt

Cell typing validation failed

Cell typing validation failed: ["Cell type 'unknown' not in configured vocabulary"]
Add the missing cell type to ~/.latch/latch-curate/cell_typing_schema.yaml:
vocabulary:
  - name: "unknown"
    ontology_id: ""
  # ... other entries

Token not found

ValueError: SDK token does not exist
Run latch login or manually create the token file:
echo "your-token" > ~/.latch/token

Workspace not configured

AssertionError (workspace_data_path)
Create the workspace file:
echo "your-workspace-id" > ~/.latch/workspace

build.yaml Reference

The build file contains all metadata for the dataset:
info:
  description: "Paper abstract text..."
  paper_title: "Single-cell analysis..."
  cell_count: 45231
  paper_url: "https://doi.org/..."
  data_url: "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE252545"
  data_external_id: "GSE252545"
  corresponding_author_names:
    - "John Smith"
    - "Jane Doe"
  corresponding_author_emails:
    - "[email protected]"
    - "[email protected]"

validation:
  metadata_validation_status: "passed"
  metadata_schema_used: "/root/.latch/latch-curate/metadata_schema.yaml"
  metadata_tags_extracted: 4
  cell_typing_validation_status: "passed"
  cell_typing_config_used: "/root/.latch/latch-curate/cell_typing_schema.yaml"
  cell_typing_tags_extracted: 8

tags:
  - metadata_type: "disease"
    value: "Alzheimer's disease"
    ontology_id: "MONDO:0004975"
  - metadata_type: "tissue"
    value: "brain"
    ontology_id: "UBERON:0000955"
  - metadata_type: "cell_type"
    value: "neuron"
    ontology_id: "CL:0000540"

curator:
  curator_id: 123
  version: "v1.0.0"
  curator_dataset_id: "GSE252545"
  upload_timestamp: "2024-01-15T10:30:00"
  ldata_node_id: 456789