Latch Curate is built on carefully designed engineering principles that enable effective collaboration between language models and human curators. These principles were developed through manually curating ten million cells spanning roughly 200 datasets and covering more than 80 autoimmune indications.
As the performance of frontier models continues to improve, we hypothesize that curation systems built around end-to-end reasoning will scale more effectively than architectures that rigidly partition function and order among multiple sub-agents.Whenever possible, latch-curate embeds task context, control-flow decisions, and tool selection within a single model call rather than orchestrating an array of specialised models with fixed interaction patterns.
We define precise validation criteria to capture edge cases, especially in agentic loops where test results provide the only feedback signal. Each criterion is split into:
A natural-language description, which guides the agent
A code assertion, which formally verifies the output and provides clear error logs
Example validation criteria:
Copy
Ask AI
# Natural language: "the var index consists of Ensembl IDs"record_and_assert(validation_log, all(map(bool, map(ensembl_pattern.match, adata.var_names))), "var index are Ensembl IDs")# Natural language: "var contains gene_symbols"record_and_assert(validation_log, 'gene_symbols' in adata.var.columns, "var contains gene_symbols")
To minimise novel reasoning per task, domain knowledge is pulled into prompts and reusable tool libraries. This focuses the model on genuine task variation, boosting accuracy while reducing runtime and cost.Tools are developed both by:
Hand-coding utilities during manual cleaning
Mining logs from earlier agentic runs to find recurring operations
Task prompts evolve in the same way, becoming living documents that record edge cases and pitfalls observed across months of cleaning.
Requesting explicit chain-of-thought traces consistently improves reasoning accuracy and provides curators with an introspectable record of the model’s logic. These traces are embedded in the output JSON and surfaced in validation reports.
Most of the engineering effort for this system went into deeply understanding the curation task and encoding that domain knowledge into prompts, tool libraries, and tests, rather than traditional software development.We manually curated ten million cells spanning roughly 200 datasets and covering more than 80 autoimmune indications to learn which parts of the problem were conserved and which truly varied.For several months, we delivered data weekly to a biotech company developing autoimmune therapies, incorporating rapid feedback from domain experts to refine the process. As the curated volume grew, our prompts, tools, and tests became more robust with exposure to diverse:
Sequencing technologies
File formats
Supplemental structures
Study designs
Downstream analytical needs
This iterative loop ensured the system met the quality bar and translational requirements of real data consumers.
Where possible, we relied on well-maintained ontologies with strong scientific backing to populate key variables:
MONDO for latch_disease
CL for latch_cell_type_lvl_1
UBERON for latch_tissue
ETF for latch_sequencing_platform
Ontology names and CURIE IDs were concatenated with slashes (e.g., “systemic sclerosis/MONDO:0005100”) to avoid column duplication.Variable scopes were set in collaboration with data consumers—detailed enough to capture study-wide nuance while remaining coarse enough to avoid ambiguities. Cell types, for example, stay at “level 1” (T cells, neutrophils, etc.), allowing users to filter atlases quickly or run specialised subtyping tools.
Creating concise validation artifacts—reports with before-and-after plots that give curators just enough information to make decisions—proved challenging. Running large, diverse datasets through the system and iterating with domain experts revealed which plots and metrics mattered most.
Human-in-the-loop efficiency scales when curators can juggle many agentic workflows simultaneously. A single task, such as count-matrix construction, may take 5–30 minutes before it needs human validation. Throughput peaks when enough concurrent runs keep the validation queue full.Ongoing work aims to streamline curator triage of agentic runs and to boost throughput by dispatching containerised tasks to workflow-orchestration software.
We adopted the Scanpy ecosystem and AnnData objects as our storage standard. Their Python-native design and widespread community support let us reuse tool libraries across agentic tasks and kept model-generated code readable.
Each task outputs assets - driver scripts, JSON files, agent logs, and reports - into directories that can be uploaded to version-controlled blob stores. Because the agentic workflow runs inside a versioned container with input data mounted to a sandboxed file system at well-defined locations, rerunning these workflows with modified information is straightforward.
Curated datasets are living assets, and new computational tools or updated scientific knowledge often require re-processing previously curated objects. The framework maintains complete reproducibility through: