Documentation Index
Fetch the complete documentation index at: https://wiki.latch.bio/llms.txt
Use this file to discover all available pages before exploring further.
Latch Curate v0.2.0
Latch Curate is an agentic Python framework designed to streamline the curation of public single cell data from GEO accession IDs into standardized, analysis-ready objects.Overview
The curation lifecycle consists of 7 concrete steps, each with a dedicated CLI command. Human validation is required at the end of each step through step-specific reports to ensure data quality and catch errors early in the process.Prerequisites
Before beginning:- Create a fresh, dedicated directory for your curation project (ideally named after your dataset)
- Each dataset requires its own directory
- Ensure you have the GEO accession ID for your dataset
Curation Steps
1. Download
Download metadata and supplementary files from GEO.- Copy and paste the paper text into
download/paper_text.txt- If the paper is unavailable or behind a paywall, use the abstract only
- If no abstract exists, copy the entire text from the GEO accession page
- Copy the paper URL to
download/paper_url.txt
- GSE and SRP metadata →
download/study_metadata.txt - Supplementary files →
download/supp_data/ - User-provided paper text →
download/paper_text.txt - User-provided paper URL →
download/paper_url.txt
paper_text.txt and study_metadata.txt (e.g., citations, dense methods sections).
2. Construct Counts
Build a counts matrix using an LLM with access to tools and terminal. The process continues until all tests pass.- Counts matrix →
construct_counts/counts.h5ad - Report →
construct_counts/counts.html
- Maps gene symbols to Ensembl IDs
- Creates
obs['latch_sample_id']from existing metadata - Ensures
var['gene_symbols']exists - Prefixes author metadata with
author_ - Validates counts are raw (non-negative integers)
3. Quality Control (QC)
Performs two-pass QC:- Conservative fixed filters (LLM-generated using paper and metadata)
- Sample-based adaptive filters (using per-sample quantile tables)
- Filtered object →
qc/qc.h5ad - Report →
qc/qc_report.html - QC parameters →
qc/qc_params.yaml
qc/qc_params.yaml to re-run QC with custom values.
4. Transform
Runs standard transformation pipeline:- Normalization
- Log transformation
- PCA
- Batch integration
- Neighborhood computation
- Embeddings generation
- Transformed matrix →
transform/transform.h5ad - Report →
transform/transform.html
5. Type Cells
Computes differential gene expression between clusters and uses an LLM with controlled vocabulary to annotate cell types.- Cell typed matrix →
type_cells/type_cells.h5ad - Report →
type_cells/type_cells.html - Annotations →
type_cells/type_cells_metadata.yaml
type_cells/type_cells_metadata.yaml to re-run cell typing with corrected values.
Run on external AnnData files:
6. Harmonize Metadata
Uses an LLM with access to all downloaded information and ontology search tools to construct harmonized variables against controlled vocabularies. Harmonized variables (at sample resolution):latch_subject_idlatch_conditionlatch_diseaselatch_tissuelatch_sample_sitelatch_sequencing_platformlatch_organism
- Harmonized object →
harmonize_metadata/harmonize_metadata.h5ad - Report →
harmonize_metadata/harmonize_metadata.html - Annotations →
harmonize_metadata/harmonize_metadata_metadata.yaml
harmonize_metadata/harmonize_metadata_metadata.yaml to re-run harmonization with corrected values.
Run on external AnnData files:
obs['latch_sample_id'] column.
Requirements: The download/ folder must still exist with study_metadata.txt and paper_text.txt files.
7. Publish
Build metadata and upload the curated dataset to the Latch Data Portal.- Configuration files at
~/.latch/latch-curate/:metadata_schema.yamlcell_typing_schema.yaml
- Latch credentials at
~/.latch/:token(fromlatch login)workspace
publish/build.yaml- Build metadata with extracted tagspublish/publish.h5ad- Final curated object
- Run linting tests: Use the Lint Curated AnnData Workflow to validate object structure.
- Convert to Seurat: Use the AnnData To Seurat Conversion Workflow if Seurat format is required.
Best Practices
- Always review reports before proceeding to the next step
- Keep the original downloaded data intact
- Document any manual corrections made during the process
- Use the
--use-paramsflag (for QC) and--use-metadataflags (for type-cells and harmonize-metadata) to iterate on automated decisions - Maintain separate directories for each dataset curation
Troubleshooting
- If the download review fails, reduce the size of text files by removing citations and detailed methods
- For construct-counts issues, use the
chatcommand to understand what went wrong - When QC seems too stringent or lenient, modify the parameters YAML and re-run with
--use-params - For incorrect cell type annotations, edit the metadata YAML and re-run with
--use-metadata - For incorrect metadata harmonization, edit the annotations YAML and re-run with
--use-metadata