Engineering Principles

Latch Curate is built on carefully designed engineering principles that enable effective collaboration between language models and human curators. These principles were developed through manually curating ten million cells spanning roughly 200 datasets and covering more than 80 autoimmune indications.

LLM Engineering Principles

1. End-to-End Reasoning

As the performance of frontier models continues to improve, we hypothesize that curation systems built around end-to-end reasoning will scale more effectively than architectures that rigidly partition function and order among multiple sub-agents. Whenever possible, latch-curate embeds task context, control-flow decisions, and tool selection within a single model call rather than orchestrating an array of specialised models with fixed interaction patterns.

2. Precise Validation Criteria

We define precise validation criteria to capture edge cases, especially in agentic loops where test results provide the only feedback signal. Each criterion is split into:
  • A natural-language description, which guides the agent
  • A code assertion, which formally verifies the output and provides clear error logs
Example validation criteria:
# Natural language: "the var index consists of Ensembl IDs"
record_and_assert(validation_log,
    all(map(bool, map(ensembl_pattern.match, adata.var_names))),
    "var index are Ensembl IDs")

# Natural language: "var contains gene_symbols"
record_and_assert(validation_log,
    'gene_symbols' in adata.var.columns,
    "var contains gene_symbols")

3. Domain Knowledge as Prompts and Tools

To minimise novel reasoning per task, domain knowledge is pulled into prompts and reusable tool libraries. This focuses the model on genuine task variation, boosting accuracy while reducing runtime and cost. Tools are developed both by:
  • Hand-coding utilities during manual cleaning
  • Mining logs from earlier agentic runs to find recurring operations
Task prompts evolve in the same way, becoming living documents that record edge cases and pitfalls observed across months of cleaning.

4. Output Integration

To integrate model outputs with conventional software, the model:
  • Writes driver scripts to canonical paths
  • Emits JSON data that conform to fixed schemas
Paths and schemas are validated in code; failures trigger automatic retries with validation errors appended to the prompt.

5. Chain-of-Thought Traces

Requesting explicit chain-of-thought traces consistently improves reasoning accuracy and provides curators with an introspectable record of the model’s logic. These traces are embedded in the output JSON and surfaced in validation reports.

Curation Principles

1. Understanding the Assignment

Most of the engineering effort for this system went into deeply understanding the curation task and encoding that domain knowledge into prompts, tool libraries, and tests, rather than traditional software development. We manually curated ten million cells spanning roughly 200 datasets and covering more than 80 autoimmune indications to learn which parts of the problem were conserved and which truly varied. For several months, we delivered data weekly to a biotech company developing autoimmune therapies, incorporating rapid feedback from domain experts to refine the process. As the curated volume grew, our prompts, tools, and tests became more robust with exposure to diverse:
  • Sequencing technologies
  • File formats
  • Supplemental structures
  • Study designs
  • Downstream analytical needs
This iterative loop ensured the system met the quality bar and translational requirements of real data consumers.

2. Ontology-Driven Variables

Where possible, we relied on well-maintained ontologies with strong scientific backing to populate key variables:
  • MONDO for latch_disease
  • CL for latch_cell_type_lvl_1
  • UBERON for latch_tissue
  • ETF for latch_sequencing_platform
Ontology names and CURIE IDs were concatenated with slashes (e.g., “systemic sclerosis/MONDO:0005100”) to avoid column duplication. Variable scopes were set in collaboration with data consumers—detailed enough to capture study-wide nuance while remaining coarse enough to avoid ambiguities. Cell types, for example, stay at “level 1” (T cells, neutrophils, etc.), allowing users to filter atlases quickly or run specialised subtyping tools.

3. Validation Artifacts

Creating concise validation artifacts—reports with before-and-after plots that give curators just enough information to make decisions—proved challenging. Running large, diverse datasets through the system and iterating with domain experts revealed which plots and metrics mattered most.

4. Parallel Agentic Workflows

Human-in-the-loop efficiency scales when curators can juggle many agentic workflows simultaneously. A single task, such as count-matrix construction, may take 5–30 minutes before it needs human validation. Throughput peaks when enough concurrent runs keep the validation queue full. Ongoing work aims to streamline curator triage of agentic runs and to boost throughput by dispatching containerised tasks to workflow-orchestration software.

Technical Implementation

Storage Standard

We adopted the Scanpy ecosystem and AnnData objects as our storage standard. Their Python-native design and widespread community support let us reuse tool libraries across agentic tasks and kept model-generated code readable.

Version Control

Each task outputs assets - driver scripts, JSON files, agent logs, and reports - into directories that can be uploaded to version-controlled blob stores. Because the agentic workflow runs inside a versioned container with input data mounted to a sandboxed file system at well-defined locations, rerunning these workflows with modified information is straightforward.

Reproducibility

Curated datasets are living assets, and new computational tools or updated scientific knowledge often require re-processing previously curated objects. The framework maintains complete reproducibility through:
  • Versioned containers
  • Fixed input/output paths
  • Comprehensive logging
  • Parameter files for each processing step