> ## Documentation Index
> Fetch the complete documentation index at: https://wiki.latch.bio/llms.txt
> Use this file to discover all available pages before exploring further.

# Configuration Reference

> Configuration files for cell typing and metadata harmonization

# Configuration Reference

Latch Curate uses two YAML configuration files to customize cell typing and metadata harmonization workflows. These files allow you to define custom vocabularies, ontologies, and validation rules for your datasets.

## cell\_typing\_schema.yaml

The cell typing configuration defines the vocabulary and marker genes used for automated cell type annotation.

**Location**: `~/.latch/latch-curate/cell_typing_schema.yaml`

### Configuration Fields

#### cell\_type\_column

* **Type**: `string`
* **Required**: Yes
* **Description**: Column name where cell type annotations will be stored in `AnnData.obs`
* **Default**: `"latch_cell_type_lvl_1"`

#### cluster\_column

* **Type**: `string`
* **Required**: Yes
* **Description**: Name of the clustering column in `AnnData.obs` to use for cell typing
* **Default**: `"leiden_res_0.50"`
* **Validation**: Must exist in `AnnData.obs`

#### vocabulary

* **Type**: `list[object]`
* **Required**: Yes
* **Description**: List of allowed cell types with Cell Ontology (CL) identifiers

Each vocabulary entry contains:

* **name** (`string`): Human-readable cell type name
* **ontology\_id** (`string`): Cell Ontology ID in format `"CL:XXXXXXX"`

#### marker\_genes

* **Type**: `dict[string, list[string]]`
* **Required**: Yes
* **Description**: Mapping of cell type groups to lists of marker gene symbols
* **Keys**: Cell type group names (can differ from vocabulary names)
* **Values**: Lists of gene symbols
* **Validation**: Warnings if genes are not found in `AnnData.var['gene_symbols']`

### Example Configuration

```yaml theme={null}
cell_type_column: "latch_cell_type_lvl_1"

cluster_column: "leiden_res_0.50"

vocabulary:
  - name: "astrocyte"
    ontology_id: "CL:0000127"
  - name: "B cell"
    ontology_id: "CL:0000236"
  - name: "endothelial cell"
    ontology_id: "CL:0000115"
  - name: "T cell"
    ontology_id: "CL:0000084"

marker_genes:
  "T cell/NK cell":
    - "CD3D"
    - "CD3E"
    - "CD8A"
    - "CD4"
  "B cell/plasma cell":
    - "JCHAIN"
    - "CD19"
    - "MS4A1"
  "astrocyte":
    - "GFAP"
    - "AQP4"
  "endothelial cell":
    - "CDH5"
    - "VWF"
    - "PECAM1"
```

### Validation Rules

* The `cluster_column` must exist in the AnnData object
* Cell types in the data should match vocabulary names or use format `"name/ontology_id"`
* Ontology IDs must follow Cell Ontology format (`CL:XXXXXXX`)
* Missing marker genes generate warnings but don't fail validation

### Usage in Pipeline

The cell typing schema is used by:

* `latch-curate type-cells` - Main cell typing workflow
* `latch-curate publish build` - Validation during publication

If no custom configuration exists, the system falls back to default values defined in the codebase.

## metadata\_schema.yaml

The metadata schema defines harmonized metadata variables that should be extracted and validated against controlled vocabularies or ontologies.

**Location**: `~/.latch/latch-curate/metadata_schema.yaml`

### Configuration Fields

#### variables

* **Type**: `list[object]`
* **Required**: Yes
* **Description**: List of metadata variable definitions

Each variable contains:

##### name

* **Type**: `string`
* **Required**: Yes
* **Description**: Column name to create in `AnnData.obs`
* **Convention**: Prefix with `latch_` (e.g., `"latch_disease"`, `"latch_tissue"`)

##### description

* **Type**: `string`
* **Required**: Yes
* **Description**: Natural language description of what the variable represents
* **Usage**: Used by LLM to understand what metadata to extract

##### vocab

* **Type**: `object`
* **Required**: Yes
* **Description**: Vocabulary specification defining allowed values

The `vocab` object contains:

###### vocab.type

* **Type**: `string`
* **Required**: Yes
* **Allowed Values**:
  * `"uncontrolled"` - Free text, no validation
  * `"ontology"` - Must match terms from a specific ontology
  * `"custom"` - Must match predefined list of values

###### vocab.name

* **Type**: `string`
* **Required**: Required when `type: "ontology"`
* **Allowed Values**:
  * `"mondo"` - Disease ontology
  * `"uberon"` - Tissue/anatomy ontology
  * `"cl"` - Cell type ontology
  * `"efo"` - Experimental Factor Ontology (sequencing platforms)

###### vocab.values

* **Type**: `list[string]`
* **Required**: Required when `type: "custom"`
* **Description**: List of allowed values for custom vocabularies

### Example Configuration

```yaml theme={null}
variables:
  - name: "latch_subject_id"
    description: "unique patient subjects"
    vocab:
      type: "uncontrolled"

  - name: "latch_disease"
    description: "disease"
    vocab:
      type: "ontology"
      name: "mondo"

  - name: "latch_tissue"
    description: "tissue or anatomical site"
    vocab:
      type: "ontology"
      name: "uberon"

  - name: "latch_sample_site"
    description: "sample site"
    vocab:
      type: "custom"
      values:
        - "lesional"
        - "peri-lesional"
        - "normal"
        - "blood"
        - "in vitro"

  - name: "latch_sequencing_platform"
    description: "sequencing platform used"
    vocab:
      type: "ontology"
      name: "efo"

  - name: "latch_organism"
    description: "organism"
    vocab:
      type: "custom"
      values:
        - "homo sapiens"
        - "mus musculus"
```

### Validation Rules

* Ontology terms must be in format: `"name/ONTOLOGY_ID"` (e.g., `"systemic sclerosis/MONDO:0005100"`)
* Custom vocabulary values must exactly match one of the allowed values (case-sensitive)
* Uncontrolled fields cannot be empty
* All variables defined in the schema will be created as columns in the AnnData object

### Output Format

The harmonization process creates:

**File**: `harmonize_metadata/harmonize_metadata_metadata.yaml`

```yaml theme={null}
latch_disease:
  annotations:
    SAMPLE_001: "systemic sclerosis/MONDO:0005100"
    SAMPLE_002: "systemic sclerosis/MONDO:0005100"
  reasoning: "Based on the paper abstract..."

latch_tissue:
  annotations:
    SAMPLE_001: "skin/UBERON:0002097"
    SAMPLE_002: "skin/UBERON:0002097"
  reasoning: "Study metadata indicates..."
```

This file can be manually edited to correct errors, then re-applied using `latch-curate harmonize-metadata run --use-metadata`.

### Usage in Pipeline

The metadata schema is used by:

* `latch-curate harmonize-metadata run` - LLM-based metadata extraction
* `latch-curate publish build` - Tag extraction and validation
* `latch-curate lint` - Metadata validation

The LLM receives the variable definitions and has access to ontology search tools to find matching terms. Results are written to both the AnnData object and a YAML cache file for review and correction.

### Using with External Data

The harmonize-metadata command can work with any AnnData file using the `--adata-path` flag:

```bash theme={null}
latch-curate harmonize-metadata run --adata-path /path/to/your_data.h5ad
```

**Requirements:**

* AnnData object must have `obs['latch_sample_id']` column with sample identifiers
* The `download/` folder must exist with `study_metadata.txt` and `paper_text.txt` files
* Metadata schema must be configured at `~/.latch/latch-curate/metadata_schema.yaml`

**Example workflow for ATAC-seq:**

```bash theme={null}
# 1. Ensure your ATAC-seq AnnData has sample IDs
python3 -c "
import scanpy as sc
adata = sc.read_h5ad('atac_data.h5ad')
adata.obs['latch_sample_id'] = adata.obs['sample_name']
adata.write('atac_data.h5ad')
"

# 2. Create download folder with metadata
mkdir -p download
echo 'Study metadata here...' > download/study_metadata.txt
echo 'Paper abstract here...' > download/paper_text.txt

# 3. Run harmonization
latch-curate harmonize-metadata run --adata-path atac_data.h5ad
```
