Latch Curate v0.2.0

Latch Curate is an agentic Python framework designed to streamline the curation of public single cell data from GEO accession IDs into standardized, analysis-ready objects.

Overview

The curation lifecycle consists of 7 concrete steps, each with a dedicated CLI command. Human validation is required at the end of each step through step-specific reports to ensure data quality and catch errors early in the process.

Prerequisites

Before beginning:
  • Create a fresh, dedicated directory for your curation project (ideally named after your dataset)
  • Each dataset requires its own directory
  • Ensure you have the GEO accession ID for your dataset

Curation Steps

1. Download

Download metadata and supplementary files from GEO.
latch-curate download run --gse-id GSE252545
Required manual steps:
  • Copy and paste the paper text into download/paper_text.txt
    • If the paper is unavailable or behind a paywall, use the abstract only
    • If no abstract exists, copy the entire text from the GEO accession page
  • Copy the paper URL to download/paper_url.txt
Outputs:
  • GSE and SRP metadata → download/study_metadata.txt
  • Supplementary files → download/supp_data/
  • User-provided paper text → download/paper_text.txt
  • User-provided paper URL → download/paper_url.txt
Optional review step:
latch-curate download review
This checks if the total data size is within the model’s context window limits. If this fails, remove non-essential content from paper_text.txt and study_metadata.txt (e.g., citations, dense methods sections).

2. Construct Counts

Build a counts matrix using an LLM with access to tools and terminal. The process continues until all tests pass.
latch-curate construct-counts run
Outputs:
  • Counts matrix → construct_counts/counts.h5ad
  • Report → construct_counts/counts.html
Optional chat review:
latch-curate construct-counts chat
Summarizes the steps taken and any errors encountered during construction.

3. Quality Control (QC)

Performs two-pass QC:
  1. Conservative fixed filters (LLM-generated using paper and metadata)
  2. Sample-based adaptive filters (using per-sample quantile tables)
latch-curate qc run
Important: Inspect the report before proceeding. Outputs:
  • Filtered object → qc/qc.h5ad
  • Report → qc/qc_report.html
  • QC parameters → qc/qc_params.yaml
Modify QC parameters:
latch-curate qc run --use-params
Uses parameters from qc/qc_params.yaml to re-run QC with custom values.

4. Transform

Runs standard transformation pipeline:
  • Normalization
  • Log transformation
  • PCA
  • Batch integration
  • Neighborhood computation
  • Embeddings generation
latch-curate transform run
Important: Inspect the report before proceeding. Outputs:
  • Transformed matrix → transform/transform.h5ad
  • Report → transform/transform.html

5. Type Cells

Computes differential gene expression between clusters and uses an LLM with controlled vocabulary to annotate cell types.
latch-curate type-cells run
Important: Inspect the report before proceeding. Outputs:
  • Cell typed matrix → type_cells/type_cells.h5ad
  • Report → type_cells/type_cells.html
  • Annotations → type_cells/type_cells_metadata.yaml
Modify cell type annotations:
latch-curate type-cells run --use-metadata
Uses annotations from type_cells/type_cells_metadata.yaml to re-run cell typing with corrected values.

6. Harmonize Metadata

Uses an LLM with access to all downloaded information and ontology search tools to construct harmonized variables against controlled vocabularies. Harmonized variables (at sample resolution):
  • latch_subject_id
  • latch_condition
  • latch_disease
  • latch_tissue
  • latch_sample_site
  • latch_sequencing_platform
  • latch_organism
latch-curate harmonize-metadata run
Important: Inspect the report for sample-to-variable mapping and reasoning. Outputs:
  • Harmonized object → harmonize_metadata/harmonize_metadata.h5ad
  • Report → harmonize_metadata/harmonize_metadata.html

7. Upload, Lint, and Convert

Final steps to complete and validate the curated object:
  1. Upload to Latch Data:
    latch cp harmonize_metadata/harmonize_metadata.h5ad latch:///path/to/destination
    
    You can also upload reports and intermediate objects as needed.
  2. Run linting tests: Use the Lint Curated AnnData Workflow to validate object structure.
  3. Convert to Seurat (optional): Use the AnnData To Seurat Conversion Workflow if Seurat format is required.

Best Practices

  • Always review reports before proceeding to the next step
  • Keep the original downloaded data intact
  • Document any manual corrections made during the process
  • Use the --use-params and --use-metadata flags to iterate on automated decisions
  • Maintain separate directories for each dataset curation

Troubleshooting

  • If the download review fails, reduce the size of text files by removing citations and detailed methods
  • For construct-counts issues, use the chat command to understand what went wrong
  • When QC seems too stringent or lenient, modify the parameters YAML and re-run with --use-params
  • For incorrect cell type annotations, edit the metadata YAML and re-run with --use-metadata