Latch Curate v0.2.0
Latch Curate is an agentic Python framework designed to streamline the curation of public single cell data from GEO accession IDs into standardized, analysis-ready objects.Overview
The curation lifecycle consists of 7 concrete steps, each with a dedicated CLI command. Human validation is required at the end of each step through step-specific reports to ensure data quality and catch errors early in the process.Prerequisites
Before beginning:- Create a fresh, dedicated directory for your curation project (ideally named after your dataset)
- Each dataset requires its own directory
- Ensure you have the GEO accession ID for your dataset
Curation Steps
1. Download
Download metadata and supplementary files from GEO.- Copy and paste the paper text into
download/paper_text.txt- If the paper is unavailable or behind a paywall, use the abstract only
- If no abstract exists, copy the entire text from the GEO accession page
- Copy the paper URL to
download/paper_url.txt
- GSE and SRP metadata →
download/study_metadata.txt - Supplementary files →
download/supp_data/ - User-provided paper text →
download/paper_text.txt - User-provided paper URL →
download/paper_url.txt
paper_text.txt and study_metadata.txt (e.g., citations, dense methods sections).
2. Construct Counts
Build a counts matrix using an LLM with access to tools and terminal. The process continues until all tests pass.- Counts matrix →
construct_counts/counts.h5ad - Report →
construct_counts/counts.html
- Maps gene symbols to Ensembl IDs
- Creates
obs['latch_sample_id']from existing metadata - Ensures
var['gene_symbols']exists - Prefixes author metadata with
author_ - Validates counts are raw (non-negative integers)
3. Quality Control (QC)
Performs two-pass QC:- Conservative fixed filters (LLM-generated using paper and metadata)
- Sample-based adaptive filters (using per-sample quantile tables)
- Filtered object →
qc/qc.h5ad - Report →
qc/qc_report.html - QC parameters →
qc/qc_params.yaml
qc/qc_params.yaml to re-run QC with custom values.
4. Transform
Runs standard transformation pipeline:- Normalization
- Log transformation
- PCA
- Batch integration
- Neighborhood computation
- Embeddings generation
- Transformed matrix →
transform/transform.h5ad - Report →
transform/transform.html
5. Type Cells
Computes differential gene expression between clusters and uses an LLM with controlled vocabulary to annotate cell types.- Cell typed matrix →
type_cells/type_cells.h5ad - Report →
type_cells/type_cells.html - Annotations →
type_cells/type_cells_metadata.yaml
type_cells/type_cells_metadata.yaml to re-run cell typing with corrected values.
Run on external AnnData files:
6. Harmonize Metadata
Uses an LLM with access to all downloaded information and ontology search tools to construct harmonized variables against controlled vocabularies. Harmonized variables (at sample resolution):latch_subject_idlatch_conditionlatch_diseaselatch_tissuelatch_sample_sitelatch_sequencing_platformlatch_organism
- Harmonized object →
harmonize_metadata/harmonize_metadata.h5ad - Report →
harmonize_metadata/harmonize_metadata.html - Annotations →
harmonize_metadata/harmonize_metadata_metadata.yaml
harmonize_metadata/harmonize_metadata_metadata.yaml to re-run harmonization with corrected values.
Run on external AnnData files:
obs['latch_sample_id'] column.
Requirements: The download/ folder must still exist with study_metadata.txt and paper_text.txt files.
7. Upload, Lint, and Convert
Final steps to complete and validate the curated object:-
Upload to Latch Data:
You can also upload reports and intermediate objects as needed.
- Run linting tests: Use the Lint Curated AnnData Workflow to validate object structure.
- Convert to Seurat (optional): Use the AnnData To Seurat Conversion Workflow if Seurat format is required.
Best Practices
- Always review reports before proceeding to the next step
- Keep the original downloaded data intact
- Document any manual corrections made during the process
- Use the
--use-paramsflag (for QC) and--use-metadataflags (for type-cells and harmonize-metadata) to iterate on automated decisions - Maintain separate directories for each dataset curation
Troubleshooting
- If the download review fails, reduce the size of text files by removing citations and detailed methods
- For construct-counts issues, use the
chatcommand to understand what went wrong - When QC seems too stringent or lenient, modify the parameters YAML and re-run with
--use-params - For incorrect cell type annotations, edit the metadata YAML and re-run with
--use-metadata - For incorrect metadata harmonization, edit the annotations YAML and re-run with
--use-metadata