Skip to main content

Configuration Reference

Latch Curate uses two YAML configuration files to customize cell typing and metadata harmonization workflows. These files allow you to define custom vocabularies, ontologies, and validation rules for your datasets.

cell_typing_schema.yaml

The cell typing configuration defines the vocabulary and marker genes used for automated cell type annotation. Location: ~/.latch/latch-curate/cell_typing_schema.yaml

Configuration Fields

cell_type_column

  • Type: string
  • Required: Yes
  • Description: Column name where cell type annotations will be stored in AnnData.obs
  • Default: "latch_cell_type_lvl_1"

cluster_column

  • Type: string
  • Required: Yes
  • Description: Name of the clustering column in AnnData.obs to use for cell typing
  • Default: "leiden_res_0.50"
  • Validation: Must exist in AnnData.obs

vocabulary

  • Type: list[object]
  • Required: Yes
  • Description: List of allowed cell types with Cell Ontology (CL) identifiers
Each vocabulary entry contains:
  • name (string): Human-readable cell type name
  • ontology_id (string): Cell Ontology ID in format "CL:XXXXXXX"

marker_genes

  • Type: dict[string, list[string]]
  • Required: Yes
  • Description: Mapping of cell type groups to lists of marker gene symbols
  • Keys: Cell type group names (can differ from vocabulary names)
  • Values: Lists of gene symbols
  • Validation: Warnings if genes are not found in AnnData.var['gene_symbols']

Example Configuration

cell_type_column: "latch_cell_type_lvl_1"

cluster_column: "leiden_res_0.50"

vocabulary:
  - name: "astrocyte"
    ontology_id: "CL:0000127"
  - name: "B cell"
    ontology_id: "CL:0000236"
  - name: "endothelial cell"
    ontology_id: "CL:0000115"
  - name: "T cell"
    ontology_id: "CL:0000084"

marker_genes:
  "T cell/NK cell":
    - "CD3D"
    - "CD3E"
    - "CD8A"
    - "CD4"
  "B cell/plasma cell":
    - "JCHAIN"
    - "CD19"
    - "MS4A1"
  "astrocyte":
    - "GFAP"
    - "AQP4"
  "endothelial cell":
    - "CDH5"
    - "VWF"
    - "PECAM1"

Validation Rules

  • The cluster_column must exist in the AnnData object
  • Cell types in the data should match vocabulary names or use format "name/ontology_id"
  • Ontology IDs must follow Cell Ontology format (CL:XXXXXXX)
  • Missing marker genes generate warnings but don’t fail validation

Usage in Pipeline

The cell typing schema is used by:
  • latch-curate type-cells - Main cell typing workflow
  • latch-curate publish build - Validation during publication
If no custom configuration exists, the system falls back to default values defined in the codebase.

metadata_schema.yaml

The metadata schema defines harmonized metadata variables that should be extracted and validated against controlled vocabularies or ontologies. Location: ~/.latch/latch-curate/metadata_schema.yaml

Configuration Fields

variables

  • Type: list[object]
  • Required: Yes
  • Description: List of metadata variable definitions
Each variable contains:
name
  • Type: string
  • Required: Yes
  • Description: Column name to create in AnnData.obs
  • Convention: Prefix with latch_ (e.g., "latch_disease", "latch_tissue")
description
  • Type: string
  • Required: Yes
  • Description: Natural language description of what the variable represents
  • Usage: Used by LLM to understand what metadata to extract
vocab
  • Type: object
  • Required: Yes
  • Description: Vocabulary specification defining allowed values
The vocab object contains:
vocab.type
  • Type: string
  • Required: Yes
  • Allowed Values:
    • "uncontrolled" - Free text, no validation
    • "ontology" - Must match terms from a specific ontology
    • "custom" - Must match predefined list of values
vocab.name
  • Type: string
  • Required: Required when type: "ontology"
  • Allowed Values:
    • "mondo" - Disease ontology
    • "uberon" - Tissue/anatomy ontology
    • "cl" - Cell type ontology
    • "efo" - Experimental Factor Ontology (sequencing platforms)
vocab.values
  • Type: list[string]
  • Required: Required when type: "custom"
  • Description: List of allowed values for custom vocabularies

Example Configuration

variables:
  - name: "latch_subject_id"
    description: "unique patient subjects"
    vocab:
      type: "uncontrolled"

  - name: "latch_disease"
    description: "disease"
    vocab:
      type: "ontology"
      name: "mondo"

  - name: "latch_tissue"
    description: "tissue or anatomical site"
    vocab:
      type: "ontology"
      name: "uberon"

  - name: "latch_sample_site"
    description: "sample site"
    vocab:
      type: "custom"
      values:
        - "lesional"
        - "peri-lesional"
        - "normal"
        - "blood"
        - "in vitro"

  - name: "latch_sequencing_platform"
    description: "sequencing platform used"
    vocab:
      type: "ontology"
      name: "efo"

  - name: "latch_organism"
    description: "organism"
    vocab:
      type: "custom"
      values:
        - "homo sapiens"
        - "mus musculus"

Validation Rules

  • Ontology terms must be in format: "name/ONTOLOGY_ID" (e.g., "systemic sclerosis/MONDO:0005100")
  • Custom vocabulary values must exactly match one of the allowed values (case-sensitive)
  • Uncontrolled fields cannot be empty
  • All variables defined in the schema will be created as columns in the AnnData object

Output Format

The harmonization process creates: File: harmonize_metadata/harmonize_metadata_metadata.yaml
latch_disease:
  annotations:
    SAMPLE_001: "systemic sclerosis/MONDO:0005100"
    SAMPLE_002: "systemic sclerosis/MONDO:0005100"
  reasoning: "Based on the paper abstract..."

latch_tissue:
  annotations:
    SAMPLE_001: "skin/UBERON:0002097"
    SAMPLE_002: "skin/UBERON:0002097"
  reasoning: "Study metadata indicates..."
This file can be manually edited to correct errors, then re-applied using latch-curate harmonize-metadata run --use-metadata.

Usage in Pipeline

The metadata schema is used by:
  • latch-curate harmonize-metadata run - LLM-based metadata extraction
  • latch-curate publish build - Tag extraction and validation
  • latch-curate lint - Metadata validation
The LLM receives the variable definitions and has access to ontology search tools to find matching terms. Results are written to both the AnnData object and a YAML cache file for review and correction.

Using with External Data

The harmonize-metadata command can work with any AnnData file using the --adata-path flag:
latch-curate harmonize-metadata run --adata-path /path/to/your_data.h5ad
Requirements:
  • AnnData object must have obs['latch_sample_id'] column with sample identifiers
  • The download/ folder must exist with study_metadata.txt and paper_text.txt files
  • Metadata schema must be configured at ~/.latch/latch-curate/metadata_schema.yaml
Example workflow for ATAC-seq:
# 1. Ensure your ATAC-seq AnnData has sample IDs
python3 -c "
import scanpy as sc
adata = sc.read_h5ad('atac_data.h5ad')
adata.obs['latch_sample_id'] = adata.obs['sample_name']
adata.write('atac_data.h5ad')
"

# 2. Create download folder with metadata
mkdir -p download
echo 'Study metadata here...' > download/study_metadata.txt
echo 'Paper abstract here...' > download/paper_text.txt

# 3. Run harmonization
latch-curate harmonize-metadata run --adata-path atac_data.h5ad