Latch Curate

Progress in engineering biology increasingly depends on data-hungry statistical models to reason about emergent properties that outstrip unaided human cognition. While purpose-built industrial data-generation efforts such as perturbation atlases offer a path forward, they do not yet sample sufficiently broad observational space, especially for rare indications. Aggregated public scRNA-seq datasets form the world’s largest and most diverse repository of diseases, tissues, and patients, yet remain underutilized because manual structuring and annotation are costly.

What is Latch Curate?

Latch Curate is a human-in-the-loop agentic framework that guides an expert scientist through an ordered, step-by-step curation lifecycle and helps them perform tasks like count matrix construction, cell typing and metadata harmonization with greater efficiency and accuracy.

What is Data Curation?

In single-cell bioinformatics, curation describes the structuring of raw research data into well-defined count objects with controlled annotations fit for industrial use. This enables the re-use of existing experimental data with far less time and resources than de-novo generation. The process involves:
  • Quality Control: Filtering out low-quality data and artifacts
  • Standardization: Converting data into consistent formats and structures
  • Annotation: Adding biological context and cell type information
  • Harmonization: Aligning metadata to controlled vocabularies and ontologies
  • Validation: Ensuring data meets quality standards and specifications

Available Tools

Latch Curate CLI

An agentic Python framework that automates the curation of public single-cell data from GEO repositories. The framework uses large language models combined with bioinformatics tools to:
  • Download and process GEO datasets
  • Construct count matrices
  • Perform quality control
  • Transform and normalize data
  • Annotate cell types
  • Harmonize metadata against controlled vocabularies

Why Public Data Curation Matters

Public-data curation fills an unmet need in single-cell data aggregation. While emerging purpose-built projects are beginning to alleviate limitations of public resources - technology heterogeneity, batch effects, quality variation and sparse perturbational sampling - they still cover only a fraction of the biological landscape and will require time to reach full breadth. Nonetheless, aggregated public datasets remain the largest and most diverse reservoir of diseases, tissues and patients. For indications with small patient populations or for complex diseases demanding fine-grained stratification, statistical models must draw on these niche biological states to achieve translational utility. Despite the value of curated public datasets, these resources remain under-utilized because of the expensive human labour required for curation. Curators must blend PhD-level biological reasoning, single-cell analysis expertise, and data-engineering skills. They devote substantial effort to writing custom code that manipulates unstructured supplementary files, and comb through study metadata and primary papers for precise annotations.

The Latch Curate Approach

Human-in-the-Loop Efficiency

Human-in-the-loop efficiency scales when curators can juggle many agentic workflows simultaneously. A single task, such as count-matrix construction, may take 5–30 minutes before it needs human validation. Throughput peaks when enough concurrent runs keep the validation queue full.

Automation with Human Oversight

  • Automates repetitive curation tasks while maintaining human validation checkpoints
  • Generates detailed reports at each step for quality assurance
  • Allows iterative refinement of automated decisions
  • Presents artifacts with plots and chain-of-thought reasoning for curator review

Standardization

  • Ensures consistent processing across datasets
  • Harmonizes metadata to standard ontologies (MONDO, CL, UBERON, ETF)
  • Produces interoperable data formats (AnnData, Seurat)
  • Adopts the Scanpy ecosystem and AnnData objects as storage standard

Reproducibility

  • Documents all processing steps
  • Maintains parameter files for reproducible analysis
  • Tracks provenance throughout the curation pipeline
  • Outputs assets (driver scripts, JSON files, agent logs, reports) into version-controlled directories

Integration

  • Seamlessly integrates with Latch Data for storage
  • Works with existing Latch workflows for downstream analysis
  • Supports standard single-cell analysis formats
  • Deployed on the LatchBio platform and used by internal biotech teams and third-party solution providers

Use Cases

  • Public Data Integration: Curate GEO datasets for meta-analysis
  • Data Harmonization: Standardize internal datasets to common formats
  • Atlas Building: Prepare datasets for integration into cell atlases
  • Quality Assurance: Validate and clean experimental data before analysis

Getting Help

  • Review the Getting Started Guide for detailed instructions
  • Check individual step documentation for specific parameters and options
  • Contact support for assistance with complex curation projects