Cross-Domain Optimisation Evaluation: Generalisation Under Pressure

Cross-Platform

MedicalAI

Cross-Domain

Specialised

A system that performs well in a single domain is not the same as a system that can be trusted across all of them. Most AI research validates in controlled, homogeneous conditions. Most production environments are neither. The question this research addresses is simple to state and genuinely difficult to answer: does the approach hold when the landscape changes beneath it?

In clinical informatics, that question is not academic. It is the difference between a system that finds the patients who need to be found and one that misses them quietly, at scale, across years of accumulated records.

The Problem

Specialisation is a hidden fragility. An AI system trained and evaluated on a single data type, imaging, structured records, clinical text, has been tested on the data it was designed for. What has not been tested is whether it degrades gracefully, or at all predictably, when the data it encounters in production is different in character from the data it learned from.

In most domains this is a performance concern. In clinical environments it is a patient safety concern. A rare genetic disorder does not present itself tidily in one data modality. The evidence is distributed: fragments in clinical notes written by different practitioners over years, values in structured test results, patterns in imaging, absences as much as presences. A system that can reason fluently within one of these modalities and poorly across the others will miss signals not because it is wrong, but because it is incomplete.

The data problem compounds this. Rare conditions are rare by definition. Training data is scarce, unevenly distributed, and in clinical settings almost never shareable across institutions. A system that requires abundant, clean, homogeneous data to perform reliably is a system that cannot be deployed where the need is greatest.

The Experiment

The evaluation is designed to test generalisation directly, by constructing a problem that cannot be solved by domain-specific optimisation alone. The system is evaluated across interleaved, divergent data landscapes: medical imaging under realistic noise conditions, structured patient data, and unstructured clinical text including reporting summaries. These are not similar problems presented sequentially. They are genuinely different problems: different data types, different noise characteristics, with different analytical demands, presented in combination.

Stability across this combination is the primary result of interest. Not peak performance in any single domain, but consistent, reliable behaviour when the problem shifts character between evaluations. A system that holds its footing across divergent landscapes can be trusted in environments that do not stay within tidy boundaries. One that does not has a fragility that controlled evaluation will never reveal.

Evaluation Methodology

Where real patient data cannot be used, which in this domain is almost always, synthetic data and procured datasets with known ground truth provide the evaluation surface. This is not a compromise. It is the honest methodology. Synthetic data with known properties allows the evaluation to be reproducible, the results to be independently verified, and the system's behaviour under specific conditions to be deliberately tested rather than accidentally encountered. Procured datasets with known numbers and types of conditions allow precision and recall to be measured against a ground truth that does not depend on incomplete or ambiguous clinical labelling.

Fine-Tuning and Retrieval as Complements

The clinical records application brings together two approaches that are often presented as alternatives. They are not. For the problem of identifying patients with rare conditions across years of heterogeneous electronic records, fine-tuning and retrieval-augmented generation address different parts of the challenge. And the boundary cases that neither handles alone are precisely the patients most at risk of being missed.

Fine-tuning embeds the domain knowledge: the clinical language, the diagnostic patterns, the rare presentation signatures that appear infrequently enough in training data to be invisible to a general-purpose model. A fine-tuned model does not need to retrieve what a rare condition looks like in clinical notes. It has learned it. That knowledge is stable, consistent, and present at every inference.

Retrieval handles the breadth: the current record, the recent notes, the structured data that changes with every encounter. It surfaces the specific evidence relevant to a specific patient at the moment of query, without requiring that evidence to have been present during training.

Together they reach the boundary cases. The patient whose presentation is suggestive but not definitive, whose notes are incomplete, whose test values sit at the edge of the diagnostic threshold. Either system alone would likely exclude them. The combination, reasoning across embedded knowledge and retrieved evidence simultaneously, can include them. In rare disease identification, the boundary cases are not marginal. They are the patients the system exists to find.

The Metric That Matters

This is the point that separates a clinically serious AI system from one that is merely technically competent, and it deserves to be stated plainly.

Standard AI evaluation optimises for accuracy, the proportion of predictions the system gets right. In most domains, this is a reasonable objective. In clinical screening for serious conditions, it is the wrong one. Accuracy treats every error as equally costly. In medicine, errors are not equally costly. They are not even close.

Consider the two ways a screening system can be wrong. A false positive (FP) tells a healthy patient that they may have a condition they do not have. This causes anxiety, prompts further investigation, and ultimately resolves correctly when a follow-up examination finds nothing. The patient is distressed. Theyare not harmed.

A false negative (FN) tells a patient with a condition that they are clear. They leave the system. The condition progresses. In rare genetic disorders, the window for intervention may be narrow. The consequences of a missed diagnosis are not a correctable inconvenience. They can be irreversible.

The F-beta metric is the technical expression of this clinical judgement. It is a measure of a system's performance that allows the designer to explicitly weight recall (the system's ability to find every true case) against precision (its ability to avoid false alarms). A beta value greater than one weights recall more heavily. It tells the system, in mathematical terms, that missing a patient who has the condition is worse than flagging a patient who does not.

This is not a technical preference. It is a clinical decision, made explicitly, auditably, and adjustably by the people responsible for the system's behaviour. It can be reviewed by clinical governance. It can be changed if the clinical context changes. It can be reported to regulators as a documented design choice with a documented rationale.

A system that has never had this conversation, that ships with a default accuracy metric and calls it evaluated, has not been designed for clinical use. It has been adapted to it, which is a different thing entirely.

The Potential

Electronic medical records now exist at scale across most healthcare systems in the developed world. Within them, buried in years of clinical notes, test results, and referral letters, are patients with conditions that have not yet been identified. Not because the evidence is absent, but because no human reviewer has had the time or the systematic method to find it.

Retrospective search across these records, finding patients who match the profile of a rare condition from historical data, is one of the most significant untapped applications of clinical AI. It does not require real-time inference. It requires a system that can read across heterogeneous data types at scale, reason about ambiguous and incomplete evidence, and flag candidates for clinical review with enough sensitivity to include the boundary cases and enough transparency to explain why each candidate was selected.

Real-time flagging, identifying candidates as new records are created rather than trawling historical data, extends this further. A patient who presents today with a pattern consistent with a rare genetic disorder can be flagged for specialist review at the point of encounter, before years of records accumulate that a retrospective search would later have to reconstruct.

The downstream imaging examination, where flagged candidates proceed to targeted imaging review, connects directly to the production realities described in the Medical Imaging primer: the cross-device challenges, the transformation layer architecture, and the governance frameworks that make AI-assisted clinical review trustworthy rather than merely functional.

The research foundation for this work is experimental and ongoing. The potential it points toward is neither speculative nor distant. The components exist. The methodology is established. The clinical reasoning is sound. What remains is the engineering discipline and governance rigour to bring them together in production, which is, as always, where the real work is done.

Medical Imaging 12 of 12 Method Not Magic