Progress in engineering biology increasingly depends on data-hungry statistical models to reason about emergent properties that outstrip unaided human cognition. While purpose-built industrial data-generation efforts such as perturbation atlases offer a path forward, they do not yet sample sufficiently broad observational space, especially for rare indications. Aggregated public scRNA-seq datasets form the world’s largest and most diverse repository of diseases, tissues, and patients, yet remain underutilized because manual structuring and annotation are costly.
Latch Curate is a human-in-the-loop agentic framework that guides an expert scientist through an ordered, step-by-step curation lifecycle and helps them perform tasks like count matrix construction, cell typing and metadata harmonization with greater efficiency and accuracy.
In single-cell bioinformatics, curation describes the structuring of raw research data into well-defined count objects with controlled annotations fit for industrial use. This enables the re-use of existing experimental data with far less time and resources than de-novo generation.The process involves:
Quality Control: Filtering out low-quality data and artifacts
Standardization: Converting data into consistent formats and structures
Annotation: Adding biological context and cell type information
Harmonization: Aligning metadata to controlled vocabularies and ontologies
Validation: Ensuring data meets quality standards and specifications
An agentic Python framework that automates the curation of public single-cell data from GEO repositories. The framework uses large language models combined with bioinformatics tools to:
Download and process GEO datasets
Construct count matrices
Perform quality control
Transform and normalize data
Annotate cell types
Harmonize metadata against controlled vocabularies
Public-data curation fills an unmet need in single-cell data aggregation. While emerging purpose-built projects are beginning to alleviate limitations of public resources - technology heterogeneity, batch effects, quality variation and sparse perturbational sampling - they still cover only a fraction of the biological landscape and will require time to reach full breadth. Nonetheless, aggregated public datasets remain the largest and most diverse reservoir of diseases, tissues and patients. For indications with small patient populations or for complex diseases demanding fine-grained stratification, statistical models must draw on these niche biological states to achieve translational utility.Despite the value of curated public datasets, these resources remain under-utilized because of the expensive human labour required for curation. Curators must blend PhD-level biological reasoning, single-cell analysis expertise, and data-engineering skills. They devote substantial effort to writing custom code that manipulates unstructured supplementary files, and comb through study metadata and primary papers for precise annotations.
Human-in-the-loop efficiency scales when curators can juggle many agentic workflows simultaneously. A single task, such as count-matrix construction, may take 5–30 minutes before it needs human validation. Throughput peaks when enough concurrent runs keep the validation queue full.