LoRA with LLaMA: Fine-Tuning as Embedded Knowledge

Research

Production

LLaMA

Industrial Research

Retrieval-augmented generation has become the default approach for giving language models access to organisational knowledge. It is fast to deploy, relatively straightforward to explain, and works well when the right content can reliably be found at inference time. For many use cases, it is the right choice.

For some, it is not. And understanding the difference is one of the more consequential architectural decisions an organisation can make when deploying AI on proprietary data.

The Problem RAG Cannot Solve

RAG works by retrieving relevant content from a document store at the moment a query arrives, then presenting that content to the model as context. The model reasons over what it finds. It does not own the knowledge, it borrows it, briefly, for each inference.

This creates a class of failure modes that are easy to overlook during evaluation and consequential in production. Retrieval can fail. The right document may not be found, or may be found partially, or may be retrieved alongside competing content that confuses the reasoning. In regulated environments, the provenance of every inference matters and a system whose answers depend on what it happened to retrieve on a given query is harder to audit than one whose knowledge is stable, embedded, and consistent.

There is a deeper issue for organisations where data sovereignty is non-negotiable. Medical records, proprietary financial models, classified research, competitive intelligence. Data that cannot leave the building cannot be sent to an external model at inference time, even as retrieval context. The architecture that works for publicly available knowledge breaks entirely for knowledge that must stay inside the organisation's own infrastructure.

Fine-tuning addresses this differently. Rather than retrieving knowledge at inference time, it embeds knowledge into the model's weights during training. The model does not look up what it knows. It knows it. The knowledge is stable, consistent, testable, and entirely contained within a model that the organisation owns and operates on its own terms.

The Experiment

Fine-tuning a large language model from scratch is computationally prohibitive for most organisations. Low-Rank Adaptation (LoRA) provides a practical path. Rather than updating all of a model's parameters during training, LoRA introduces small, trainable matrices at key points in the architecture. The base model remains frozen. The adaptation is lightweight, targeted, and reversible. The result is a model that behaves as if it has been fine-tuned on the new data, at a fraction of the computational cost of full retraining.

The experiment applies LoRA to LLaMA-3 8B using the FreshQA dataset as the evaluation ground. FreshQA is a question-answering benchmark specifically designed to test knowledge that changes over time, making it a demanding evaluation surface for any fine-tuning approach. A model that memorises rather than learns will perform poorly here. One that has genuinely adapted to new knowledge will not.

Infrastructure and Scale

The experiment runs on eight H200 GPUs. Serious compute by any measure, and deliberately so. One of the practical questions organisations face when considering fine-tuning is whether the infrastructure demands are tractable. This setup is designed to be representative of what a committed production fine-tuning pipeline looks like: distributed training across multiple high-memory GPUs, careful management of batch sizes and gradient accumulation, and the engineering discipline required to keep a multi-GPU run stable across the full training cycle.

Distributed training introduces coordination challenges that single-GPU experiments do not. Gradient synchronisation, memory management across devices, and the debugging of failures that only manifest at scale all require specific competencies. The experiment is designed to encounter and resolve these challenges, not avoid them.

ECO-Guided Optimisation

LoRA introduces its own hyperparameters: the rank of the adaptation matrices, the scaling factor, which layers to adapt, and how aggressively to regularise. These choices materially affect the quality of the fine-tuned model, and the right values are not obvious in advance. This is precisely the setting where ECO's constructive optimisation approach is most valuable.

Rather than relying on default values or manual tuning, ECO guides the search for effective LoRA configurations, constructing its search landscape from empirical feedback as the experiment progresses. The hyperparameter selection that in many fine-tuning projects falls to human intuition and trial-and-error is handled systematically, transparently, and without embedding the researcher's assumptions about where good configurations are likely to live.

Evaluation

Performance is assessed using deterministic metrics on held-out FreshQA data, i.e. questions the model has not encountered during training. This is the only honest measure of whether fine-tuning has produced genuine knowledge adaptation or surface-level pattern matching. The evaluation methodology is designed to be reproducible and resistant to the kinds of optimistic reporting that can emerge when evaluation data bleeds into training.

What It Means for Organisations

The experiment is research. The implications are operational.

For organisations with proprietary knowledge that cannot be externalised, fine-tuning on sovereign infrastructure is not an exotic option. It is often the only architecturally sound one. A model fine-tuned on that data, running on infrastructure the organisation controls, answering queries without any external dependency at inference time, is a qualitatively different capability from a retrieval system that hopes to find the right document at the right moment.

The knowledge is embedded. It can be tested against held-out data before deployment. It can be versioned with different fine-tuned models for different knowledge states, with clear lineage between them. It can be audited, because the training data and the evaluation results are both known quantities. And it can be updated through further fine-tuning as the organisation's knowledge evolves, without rebuilding the retrieval infrastructure that a RAG system depends on.

Fine-tuning is not the right choice for every use case. It requires more upfront investment than retrieval, and it is less suited to knowledge that changes continuously. But for the organisations and use cases where it fits, and there are more of them than the current industry default would suggest, it produces something that retrieval cannot: a model that genuinely knows what the organisation knows, on the organisation's own terms.

The infrastructure challenge is real but tractable. The engineering discipline required is established. The path from experiment to production is well understood. This is advanced work. It is not inaccessible work.

ECO 10 of 12 Medical Imaging

LoRA with LLaMA: Fine-Tuning as Embedded Knowledge

The Problem RAG Cannot Solve

The Experiment

What It Means for Organisations

Related

Deeper Dives