Latch Curate v0.2.0
Latch Curate is an agentic Python framework designed to streamline the curation of public single cell data from GEO accession IDs into standardized, analysis-ready objects.Overview
The curation lifecycle consists of 7 concrete steps, each with a dedicated CLI command. Human validation is required at the end of each step through step-specific reports to ensure data quality and catch errors early in the process.Prerequisites
Before beginning:- Create a fresh, dedicated directory for your curation project (ideally named after your dataset)
- Each dataset requires its own directory
- Ensure you have the GEO accession ID for your dataset
Curation Steps
1. Download
Download metadata and supplementary files from GEO.- Copy and paste the paper text into
download/paper_text.txt
- If the paper is unavailable or behind a paywall, use the abstract only
- If no abstract exists, copy the entire text from the GEO accession page
- Copy the paper URL to
download/paper_url.txt
- GSE and SRP metadata →
download/study_metadata.txt
- Supplementary files →
download/supp_data/
- User-provided paper text →
download/paper_text.txt
- User-provided paper URL →
download/paper_url.txt
paper_text.txt
and study_metadata.txt
(e.g., citations, dense methods sections).
2. Construct Counts
Build a counts matrix using an LLM with access to tools and terminal. The process continues until all tests pass.- Counts matrix →
construct_counts/counts.h5ad
- Report →
construct_counts/counts.html
3. Quality Control (QC)
Performs two-pass QC:- Conservative fixed filters (LLM-generated using paper and metadata)
- Sample-based adaptive filters (using per-sample quantile tables)
- Filtered object →
qc/qc.h5ad
- Report →
qc/qc_report.html
- QC parameters →
qc/qc_params.yaml
qc/qc_params.yaml
to re-run QC with custom values.
4. Transform
Runs standard transformation pipeline:- Normalization
- Log transformation
- PCA
- Batch integration
- Neighborhood computation
- Embeddings generation
- Transformed matrix →
transform/transform.h5ad
- Report →
transform/transform.html
5. Type Cells
Computes differential gene expression between clusters and uses an LLM with controlled vocabulary to annotate cell types.- Cell typed matrix →
type_cells/type_cells.h5ad
- Report →
type_cells/type_cells.html
- Annotations →
type_cells/type_cells_metadata.yaml
type_cells/type_cells_metadata.yaml
to re-run cell typing with corrected values.
6. Harmonize Metadata
Uses an LLM with access to all downloaded information and ontology search tools to construct harmonized variables against controlled vocabularies. Harmonized variables (at sample resolution):latch_subject_id
latch_condition
latch_disease
latch_tissue
latch_sample_site
latch_sequencing_platform
latch_organism
- Harmonized object →
harmonize_metadata/harmonize_metadata.h5ad
- Report →
harmonize_metadata/harmonize_metadata.html
7. Upload, Lint, and Convert
Final steps to complete and validate the curated object:-
Upload to Latch Data:
You can also upload reports and intermediate objects as needed.
- Run linting tests: Use the Lint Curated AnnData Workflow to validate object structure.
- Convert to Seurat (optional): Use the AnnData To Seurat Conversion Workflow if Seurat format is required.
Best Practices
- Always review reports before proceeding to the next step
- Keep the original downloaded data intact
- Document any manual corrections made during the process
- Use the
--use-params
and--use-metadata
flags to iterate on automated decisions - Maintain separate directories for each dataset curation
Troubleshooting
- If the download review fails, reduce the size of text files by removing citations and detailed methods
- For construct-counts issues, use the
chat
command to understand what went wrong - When QC seems too stringent or lenient, modify the parameters YAML and re-run with
--use-params
- For incorrect cell type annotations, edit the metadata YAML and re-run with
--use-metadata