Latch Curate

Progress in engineering biology increasingly depends on data-hungry statistical models to reason about emergent properties that outstrip unaided human cognition. While purpose-built industrial data-generation efforts such as perturbation atlases offer a path forward, they do not yet sample sufficiently broad observational space, especially for rare indications. Aggregated public scRNA-seq datasets form the world’s largest and most diverse repository of diseases, tissues, and patients, yet remain underutilized because manual structuring and annotation are costly.

What is Latch Curate?

Latch Curate is a human-in-the-loop agentic framework that guides an expert scientist through an ordered, step-by-step curation lifecycle and helps them perform tasks like count matrix construction, cell typing and metadata harmonization with greater efficiency and accuracy.

What is Data Curation?

In single-cell bioinformatics, curation describes the structuring of raw research data into well-defined count objects with controlled annotations fit for industrial use. This enables the re-use of existing experimental data with far less time and resources than de-novo generation. The process involves:

Quality Control: Filtering out low-quality data and artifacts
Standardization: Converting data into consistent formats and structures
Annotation: Adding biological context and cell type information
Harmonization: Aligning metadata to controlled vocabularies and ontologies
Validation: Ensuring data meets quality standards and specifications

Available Tools

Latch Curate CLI

An agentic Python framework that automates the curation of public single-cell data from GEO repositories. The framework uses large language models combined with bioinformatics tools to:

Download and process GEO datasets
Construct count matrices
Perform quality control
Transform and normalize data
Annotate cell types
Harmonize metadata against controlled vocabularies

Why Public Data Curation Matters

Public-data curation fills an unmet need in single-cell data aggregation. While emerging purpose-built projects are beginning to alleviate limitations of public resources - technology heterogeneity, batch effects, quality variation and sparse perturbational sampling - they still cover only a fraction of the biological landscape and will require time to reach full breadth. Nonetheless, aggregated public datasets remain the largest and most diverse reservoir of diseases, tissues and patients. For indications with small patient populations or for complex diseases demanding fine-grained stratification, statistical models must draw on these niche biological states to achieve translational utility. Despite the value of curated public datasets, these resources remain under-utilized because of the expensive human labour required for curation. Curators must blend PhD-level biological reasoning, single-cell analysis expertise, and data-engineering skills. They devote substantial effort to writing custom code that manipulates unstructured supplementary files, and comb through study metadata and primary papers for precise annotations.

The Latch Curate Approach

Human-in-the-Loop Efficiency

Human-in-the-loop efficiency scales when curators can juggle many agentic workflows simultaneously. A single task, such as count-matrix construction, may take 5–30 minutes before it needs human validation. Throughput peaks when enough concurrent runs keep the validation queue full.

Automation with Human Oversight

Automates repetitive curation tasks while maintaining human validation checkpoints
Generates detailed reports at each step for quality assurance
Allows iterative refinement of automated decisions
Presents artifacts with plots and chain-of-thought reasoning for curator review

Standardization

Ensures consistent processing across datasets
Harmonizes metadata to standard ontologies (MONDO, CL, UBERON, ETF)
Produces interoperable data formats (AnnData, Seurat)
Adopts the Scanpy ecosystem and AnnData objects as storage standard

Reproducibility

Documents all processing steps
Maintains parameter files for reproducible analysis
Tracks provenance throughout the curation pipeline
Outputs assets (driver scripts, JSON files, agent logs, reports) into version-controlled directories

Integration

Seamlessly integrates with Latch Data for storage
Works with existing Latch workflows for downstream analysis
Supports standard single-cell analysis formats
Deployed on the LatchBio platform and used by internal biotech teams and third-party solution providers

Use Cases

Public Data Integration: Curate GEO datasets for meta-analysis
Data Harmonization: Standardize internal datasets to common formats
Atlas Building: Prepare datasets for integration into cell atlases
Quality Assurance: Validate and clean experimental data before analysis

Getting Help

Review the Getting Started Guide for detailed instructions
Check individual step documentation for specific parameters and options
Contact support for assistance with complex curation projects

Curate

Configuration

Curate Overview

Latch Curate

What is Latch Curate?

What is Data Curation?

Available Tools

Latch Curate CLI

Why Public Data Curation Matters

The Latch Curate Approach

Human-in-the-Loop Efficiency

Automation with Human Oversight

Standardization

Reproducibility

Integration

Use Cases

Getting Help

Curate

Configuration

​Latch Curate

​What is Latch Curate?

​What is Data Curation?

​Available Tools

​Latch Curate CLI

​Why Public Data Curation Matters

​The Latch Curate Approach

​Human-in-the-Loop Efficiency

​Automation with Human Oversight

​Standardization

​Reproducibility

​Integration

​Use Cases

​Getting Help

Latch Curate

What is Latch Curate?

What is Data Curation?

Available Tools

Latch Curate CLI

Why Public Data Curation Matters

The Latch Curate Approach

Human-in-the-Loop Efficiency

Automation with Human Oversight

Standardization

Reproducibility

Integration

Use Cases

Getting Help