> ## Documentation Index
> Fetch the complete documentation index at: https://wiki.latch.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# Publishing Datasets

> Upload curated datasets to the Latch Data Portal

# Publishing Datasets

After completing the curation pipeline, the publish commands help you build metadata, upload datasets to Latch Data, and notify paper authors.

## Prerequisites

Before publishing, ensure you have:

1. **Completed the curation pipeline** through `harmonize-metadata`
2. **Configuration files** in `~/.latch/latch-curate/`:
   * `metadata_schema.yaml` - Metadata harmonization schema
   * `cell_typing_schema.yaml` - Cell typing vocabulary
3. **Latch credentials** in `~/.latch/`:
   * `token` - Your Latch SDK token
   * `workspace` - Workspace ID (JSON or plaintext)

### Setting Up Credentials

```bash theme={null}
mkdir -p ~/.latch

# Token is automatically created when you run `latch login`
# Or manually create it:
echo "your-sdk-token" > ~/.latch/token

# Workspace ID (get from Latch Console settings)
echo "your-workspace-id" > ~/.latch/workspace
```

### Setting Up Configuration Files

```bash theme={null}
mkdir -p ~/.latch/latch-curate

# Copy the cell typing schema from the repo
cp cell_typing_schema.yaml ~/.latch/latch-curate/

# Create metadata schema (see Configuration Reference for format)
cat > ~/.latch/latch-curate/metadata_schema.yaml << 'EOF'
variables:
  - name: "disease"
    description: "disease or condition studied"
    vocab:
      type: "ontology"
      name: "mondo"
  - name: "tissue"
    description: "tissue or anatomical site"
    vocab:
      type: "ontology"
      name: "uberon"
  - name: "assay"
    description: "sequencing assay used"
    vocab:
      type: "ontology"
      name: "efo"
  - name: "sample_site"
    description: "sample collection site"
    vocab:
      type: "custom"
      values: ["tumor", "normal", "metastasis", "blood"]
EOF
```

## Publish Workflow

### Step 1: Build

Generate metadata and validate the curated dataset.

```bash theme={null}
latch-curate publish build
```

This command:

* Extracts paper title and abstract via API
* Retrieves corresponding author contact information
* Validates harmonized metadata against your schema
* Validates cell typing against configured vocabulary
* Extracts ontology tags (disease, tissue, assay, cell types)
* Generates `publish/build.yaml` with all metadata

**Required files:**

* `download/paper_text.txt` - Paper text or abstract
* `download/paper_url.txt` - URL to the paper
* `download/external_id.txt` - GEO accession ID
* `harmonize_metadata/harmonize_metadata.h5ad` - Curated AnnData

**Outputs:**

* `publish/build.yaml` - Build metadata file
* `publish/publish.h5ad` - Final curated object

**Example output:**

```
Build complete! Please verify the following information:
============================================================
Paper Title: Single-cell analysis of human tissues
Paper Abstract: We performed single-cell RNA sequencing...
Cell Count: 45,231
Authors: Smith J, Jones A
Email Contacts: smith@university.edu
Metadata Validation Status: passed
Metadata Tags Extracted: 4
Cell Typing Validation Status: passed
Cell Typing Tags Extracted: 8
All Tags:
  - disease: Alzheimer's disease
  - tissue: brain
  - assay: 10x 3' v3
  - cell_type: neuron
  - cell_type: astrocyte
  ... and 6 more
============================================================
```

### Step 2: Upload

Upload the dataset to Latch Data and register it in the data portal.

```bash theme={null}
latch-curate publish upload
```

You will be prompted for:

* **Destination path**: Where to store the dataset in Latch Data (e.g., `latch:///datasets/`)
* **Curator organization ID**: Your organization's ID in the system
* **Dataset version**: Version string (e.g., `v1.0.0`)
* **Curator dataset ID**: Unique identifier for this dataset (defaults to GEO ID)

**Or provide options directly:**

```bash theme={null}
latch-curate publish upload \
  --latch-dest "latch:///curated-datasets/" \
  --curator-id 123 \
  --version "v1.0.0" \
  --curator-dataset-id "GSE252545"
```

**What happens:**

1. Uploads `publish/` directory to Latch Data
2. Retrieves the ldata node ID for the uploaded files
3. Registers the dataset with the data portal API
4. Returns family ID and dataset ID on success

**Example output:**

```
Dataset Upload
Paper Title: Single-cell analysis of human tissues
Cell Count: 45,231
Validation Status: passed
Tags: 12 extracted
Uploading dataset...
Curator ID: 123
Version: v1.0.0
Dataset ID: GSE252545
Retrieved node ID 456789 for latch:///curated-datasets/GSE252545
Uploading dataset to active workspace 456789
Upload complete!
Family ID: 100
Dataset ID: 200
```

### Step 3: Email (Optional)

Send notification emails to paper authors about the curated dataset.

```bash theme={null}
latch-curate publish email
```

**Prerequisites:**

* Email configuration at `~/.latch/latch-curate/email-info.json`:

```json theme={null}
{
  "smtp_host": "smtp.example.com",
  "smtp_port": 587,
  "smtp_user": "your-email@example.com",
  "smtp_password": "your-password",
  "sender_addr": "curation@latch.bio",
  "starttls": true,
  "timeout": 30
}
```

## Troubleshooting

### Missing configuration files

```
AssertionError (metadata_schema_path or cell_typing_config_path)
```

Ensure configuration files exist at `~/.latch/latch-curate/`. See [Configuration Reference](/curate/configuration) for schema formats.

### Missing pipeline files

```
AssertionError (paper_url_file, paper_text_file, etc.)
```

Run the full curation pipeline first, or create the required files manually:

```bash theme={null}
mkdir -p download harmonize_metadata
echo "https://example.com/paper" > download/paper_url.txt
echo "Paper text here..." > download/paper_text.txt
echo "GSE12345" > download/external_id.txt
```

### Cell typing validation failed

```
Cell typing validation failed: ["Cell type 'unknown' not in configured vocabulary"]
```

Add the missing cell type to `~/.latch/latch-curate/cell_typing_schema.yaml`:

```yaml theme={null}
vocabulary:
  - name: "unknown"
    ontology_id: ""
  # ... other entries
```

### Token not found

```
ValueError: SDK token does not exist
```

Run `latch login` or manually create the token file:

```bash theme={null}
echo "your-token" > ~/.latch/token
```

### Workspace not configured

```
AssertionError (workspace_data_path)
```

Create the workspace file:

```bash theme={null}
echo "your-workspace-id" > ~/.latch/workspace
```

## build.yaml Reference

The build file contains all metadata for the dataset:

```yaml theme={null}
info:
  description: "Paper abstract text..."
  paper_title: "Single-cell analysis..."
  cell_count: 45231
  paper_url: "https://doi.org/..."
  data_url: "https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE252545"
  data_external_id: "GSE252545"
  corresponding_author_names:
    - "John Smith"
    - "Jane Doe"
  corresponding_author_emails:
    - "smith@university.edu"
    - "doe@institute.org"

validation:
  metadata_validation_status: "passed"
  metadata_schema_used: "/root/.latch/latch-curate/metadata_schema.yaml"
  metadata_tags_extracted: 4
  cell_typing_validation_status: "passed"
  cell_typing_config_used: "/root/.latch/latch-curate/cell_typing_schema.yaml"
  cell_typing_tags_extracted: 8

tags:
  - metadata_type: "disease"
    value: "Alzheimer's disease"
    ontology_id: "MONDO:0004975"
  - metadata_type: "tissue"
    value: "brain"
    ontology_id: "UBERON:0000955"
  - metadata_type: "cell_type"
    value: "neuron"
    ontology_id: "CL:0000540"

curator:
  curator_id: 123
  version: "v1.0.0"
  curator_dataset_id: "GSE252545"
  upload_timestamp: "2024-01-15T10:30:00"
  ldata_node_id: 456789
```
