> ## Documentation Index
> Fetch the complete documentation index at: https://wiki.latch.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# Latch Curate

> An agentic Python framework to curate public single cell data

# Latch Curate v0.2.0

Latch Curate is an agentic Python framework designed to streamline the curation of public single cell data from GEO accession IDs into standardized, analysis-ready objects.

## Overview

The curation lifecycle consists of 7 concrete steps, each with a dedicated CLI command. Human validation is required at the end of each step through step-specific reports to ensure data quality and catch errors early in the process.

## Prerequisites

Before beginning:

* Create a fresh, dedicated directory for your curation project (ideally named after your dataset)
* Each dataset requires its own directory
* Ensure you have the GEO accession ID for your dataset

## Curation Steps

### 1. Download

Download metadata and supplementary files from GEO.

```bash theme={null}
latch-curate download run --gse-id GSE252545
```

**Required manual steps:**

* Copy and paste the paper text into `download/paper_text.txt`
  * If the paper is unavailable or behind a paywall, use the abstract only
  * If no abstract exists, copy the entire text from the GEO accession page
* Copy the paper URL to `download/paper_url.txt`

**Outputs:**

* GSE and SRP metadata → `download/study_metadata.txt`
* Supplementary files → `download/supp_data/`
* User-provided paper text → `download/paper_text.txt`
* User-provided paper URL → `download/paper_url.txt`

**Optional review step:**

```bash theme={null}
latch-curate download review
```

This checks if the total data size is within the model's context window limits. If this fails, remove non-essential content from `paper_text.txt` and `study_metadata.txt` (e.g., citations, dense methods sections).

### 2. Construct Counts

Build a counts matrix using an LLM with access to tools and terminal. The process continues until all tests pass.

```bash theme={null}
latch-curate construct-counts run
```

**Outputs:**

* Counts matrix → `construct_counts/counts.h5ad`
* Report → `construct_counts/counts.html`

**Standardize existing h5ad files:**

```bash theme={null}
latch-curate construct-counts run --input-h5ad /path/to/your_data.h5ad
```

Use this flag to transform an existing AnnData file to meet latch-curate standards:

* Maps gene symbols to Ensembl IDs
* Creates `obs['latch_sample_id']` from existing metadata
* Ensures `var['gene_symbols']` exists
* Prefixes author metadata with `author_`
* Validates counts are raw (non-negative integers)

**Optional chat review:**

```bash theme={null}
latch-curate construct-counts chat
```

Summarizes the steps taken and any errors encountered during construction.

### 3. Quality Control (QC)

Performs two-pass QC:

1. Conservative fixed filters (LLM-generated using paper and metadata)
2. Sample-based adaptive filters (using per-sample quantile tables)

```bash theme={null}
latch-curate qc run
```

**Important:** Inspect the report before proceeding.

**Outputs:**

* Filtered object → `qc/qc.h5ad`
* Report → `qc/qc_report.html`
* QC parameters → `qc/qc_params.yaml`

**Modify QC parameters:**

```bash theme={null}
latch-curate qc run --use-params
```

Uses parameters from `qc/qc_params.yaml` to re-run QC with custom values.

### 4. Transform

Runs standard transformation pipeline:

* Normalization
* Log transformation
* PCA
* Batch integration
* Neighborhood computation
* Embeddings generation

```bash theme={null}
latch-curate transform run
```

**Important:** Inspect the report before proceeding.

**Outputs:**

* Transformed matrix → `transform/transform.h5ad`
* Report → `transform/transform.html`

### 5. Type Cells

Computes differential gene expression between clusters and uses an LLM with controlled vocabulary to annotate cell types.

```bash theme={null}
latch-curate type-cells run
```

**Important:** Inspect the report before proceeding.

**Outputs:**

* Cell typed matrix → `type_cells/type_cells.h5ad`
* Report → `type_cells/type_cells.html`
* Annotations → `type_cells/type_cells_metadata.yaml`

**Modify cell type annotations:**

```bash theme={null}
latch-curate type-cells run --use-metadata
```

Uses annotations from `type_cells/type_cells_metadata.yaml` to re-run cell typing with corrected values.

**Run on external AnnData files:**

```bash theme={null}
latch-curate type-cells run --adata-path /path/to/your_data.h5ad
```

Use this flag to run cell typing on any AnnData object (e.g., from old projects or different assay types).

### 6. Harmonize Metadata

Uses an LLM with access to all downloaded information and ontology search tools to construct harmonized variables against controlled vocabularies.

**Harmonized variables (at sample resolution):**

* `latch_subject_id`
* `latch_condition`
* `latch_disease`
* `latch_tissue`
* `latch_sample_site`
* `latch_sequencing_platform`
* `latch_organism`

```bash theme={null}
latch-curate harmonize-metadata run
```

**Important:** Inspect the report for sample-to-variable mapping and reasoning.

**Outputs:**

* Harmonized object → `harmonize_metadata/harmonize_metadata.h5ad`
* Report → `harmonize_metadata/harmonize_metadata.html`
* Annotations → `harmonize_metadata/harmonize_metadata_metadata.yaml`

**Modify metadata annotations:**

```bash theme={null}
latch-curate harmonize-metadata run --use-metadata
```

Uses annotations from `harmonize_metadata/harmonize_metadata_metadata.yaml` to re-run harmonization with corrected values.

**Run on external AnnData files:**

```bash theme={null}
latch-curate harmonize-metadata run --adata-path /path/to/your_data.h5ad
```

Use this flag to run metadata harmonization on any AnnData object with an `obs['latch_sample_id']` column.

**Requirements**: The `download/` folder must still exist with `study_metadata.txt` and `paper_text.txt` files.

### 7. Publish

Build metadata and upload the curated dataset to the Latch Data Portal.

```bash theme={null}
# Build metadata and validate
latch-curate publish build

# Upload to data portal
latch-curate publish upload

# (Optional) Send email to paper authors
latch-curate publish email
```

**Prerequisites:**

* Configuration files at `~/.latch/latch-curate/`:
  * `metadata_schema.yaml`
  * `cell_typing_schema.yaml`
* Latch credentials at `~/.latch/`:
  * `token` (from `latch login`)
  * `workspace`

**Outputs:**

* `publish/build.yaml` - Build metadata with extracted tags
* `publish/publish.h5ad` - Final curated object

See [Publishing Datasets](/curate/publish) for detailed setup instructions and troubleshooting.

**Additional validation (optional):**

1. **Run linting tests:**
   Use the [Lint Curated AnnData Workflow](https://console.latch.bio/workflows/109042) to validate object structure.

2. **Convert to Seurat:**
   Use the [AnnData To Seurat Conversion Workflow](https://console.latch.bio/workflows/109453) if Seurat format is required.

## Best Practices

* Always review reports before proceeding to the next step
* Keep the original downloaded data intact
* Document any manual corrections made during the process
* Use the `--use-params` flag (for QC) and `--use-metadata` flags (for type-cells and harmonize-metadata) to iterate on automated decisions
* Maintain separate directories for each dataset curation

## Troubleshooting

* If the download review fails, reduce the size of text files by removing citations and detailed methods
* For construct-counts issues, use the `chat` command to understand what went wrong
* When QC seems too stringent or lenient, modify the parameters YAML and re-run with `--use-params`
* For incorrect cell type annotations, edit the metadata YAML and re-run with `--use-metadata`
* For incorrect metadata harmonization, edit the annotations YAML and re-run with `--use-metadata`
