Blog
Retrieval-Augmented Fine-Tuning (RAFT): Merging RAG and SLM Training On-Premises
Explore how Retrieval-Augmented Fine-Tuning combines the strengths of RAG and fine-tuning to produce highly accurate, domain-specific small language models in on-premises environments.
The Gap Between RAG and Fine-Tuning
Retrieval-Augmented Generation (RAG) and fine-tuning have emerged as the two dominant strategies for adapting language models to enterprise-specific knowledge. RAG excels at keeping models current by pulling relevant documents at inference time, while fine-tuning embeds domain knowledge directly into model weights for faster, more consistent responses. Yet each approach carries limitations that become painfully apparent in production on-premises deployments.
RAG systems can struggle with retrieval quality — when the retriever surfaces irrelevant or partially relevant documents, the generator produces plausible but incorrect answers. Fine-tuned models, on the other hand, can hallucinate confidently on topics outside their training distribution and require expensive retraining cycles when the underlying knowledge base changes. Retrieval-Augmented Fine-Tuning (RAFT) addresses both shortcomings by training the model to reason over retrieved documents, effectively teaching it when to trust retrieval context and when to rely on its own parameters.
How RAFT Works: Training Models to Reason Over Context
The core insight behind RAFT is straightforward: instead of fine-tuning a model on clean question-answer pairs, you train it on question-context-answer triples where the context includes both relevant ("oracle") documents and deliberately irrelevant ("distractor") documents. During training, the model learns to identify which retrieved passages are actually useful and to extract the correct answer while ignoring noise.
A typical RAFT training pipeline on-premises follows this structure:
1. Dataset construction: For each training example, pair the question with one or more oracle documents that contain the answer, plus several distractor documents sampled from the same corpus. The ratio of oracle to distractor documents should reflect real-world retrieval noise — a 1:4 ratio is a reasonable starting point.
2. Chain-of-thought annotation: Augment each training example with a reasoning trace that shows how the model should identify the relevant passage and extract the answer. This step is critical for teaching the model to cite its sources rather than hallucinate.
3. Mixed training strategy: Include a fraction of examples (typically 10-20%) where no oracle document is present, forcing the model to recognize when retrieved context is insufficient and fall back to its parametric knowledge.
On-Premises Infrastructure Requirements
Running RAFT on-premises is feasible with modest hardware compared to full pre-training, since you are fine-tuning an existing small language model. A single node with 2-4 GPUs (NVIDIA A100 or H100) can handle RAFT training for models in the 1B-8B parameter range. The primary infrastructure considerations are data pipeline throughput and experiment tracking.
Your data pipeline needs to handle three concurrent streams: the base training corpus, the retriever index for generating context windows, and the chain-of-thought annotations. If your organization already runs an on-premises vector database for RAG (such as Milvus, Qdrant, or Weaviate), you can reuse that infrastructure for distractor document sampling during dataset construction.
Experiment tracking becomes especially important with RAFT because you are tuning multiple interdependent variables: the distractor ratio, the chain-of-thought format, the oracle document selection strategy, and the base model's learning rate. Tools like MLflow or Weights & Biases (self-hosted) help you systematically compare runs and avoid regression when iterating on training configurations.
Storage requirements are moderate. The RAFT training dataset is typically 3-5x larger than a standard fine-tuning dataset due to the included context documents, but this rarely exceeds a few hundred gigabytes even for large enterprise corpora.
Practical Implementation Patterns
The most successful RAFT deployments we have observed follow an iterative refinement loop rather than a one-shot training approach. Start with a baseline fine-tuned model and a baseline RAG system, then use the RAG system's failure cases to construct targeted RAFT training examples.
Pattern 1: Error-driven dataset construction. Log every query where your existing RAG system produces an incorrect or low-confidence answer. Pair these queries with the actually retrieved documents (which likely included noise that confused the model) and the correct answer. This creates training examples that directly address your system's weakest points.
Pattern 2: Domain-stratified distractor sampling. Rather than sampling distractors uniformly from your corpus, sample them from the same domain or document category as the oracle document. This produces harder training examples and results in a model that is better at distinguishing between topically similar but factually different passages — a common failure mode in enterprise knowledge bases where multiple versions of similar policies or procedures exist.
Pattern 3: Progressive context window expansion. Begin training with short context windows (2-3 documents) and gradually increase to 5-8 documents as the model's reasoning improves. This curriculum-style approach leads to more stable training and better final accuracy than starting with large context windows immediately.
Measuring RAFT Effectiveness
Standard language model evaluation metrics are insufficient for assessing RAFT-trained models. You need metrics that capture both answer accuracy and retrieval reasoning quality. A practical evaluation framework includes three dimensions:
Answer correctness: Measure exact match and semantic similarity against ground-truth answers on a held-out test set that includes both oracle and distractor documents. Compare against your baseline RAG system and your baseline fine-tuned model to quantify the improvement.
Attribution accuracy: Verify that the model correctly identifies which retrieved document(s) support its answer. If your RAFT training includes chain-of-thought annotations with citations, you can automatically check whether the model references the oracle document rather than a distractor.
Robustness under noise: Evaluate model performance as you increase the distractor-to-oracle ratio in the test set. A well-trained RAFT model should degrade gracefully as noise increases, rather than falling off a cliff at a particular threshold. This metric directly predicts how the model will perform in production when retrieval quality varies.
In our experience, RAFT-trained SLMs in the 3B-7B range consistently outperform both standalone RAG and standalone fine-tuning on domain-specific question-answering tasks, particularly in scenarios where the knowledge base contains dense, overlapping information — exactly the kind of content found in enterprise documentation, regulatory texts, and technical manuals.
When to Choose RAFT Over Standalone Approaches
RAFT is not a universal replacement for RAG or fine-tuning. It is most valuable when your use case exhibits specific characteristics: a stable but complex knowledge base, high accuracy requirements, and frequent retrieval noise. If your documents change daily and accuracy requirements are moderate, a well-tuned RAG pipeline may be sufficient. If your domain is narrow and static, pure fine-tuning may be simpler to maintain.
The sweet spot for RAFT on-premises is enterprise environments where the knowledge base updates on a weekly or monthly cadence, where incorrect answers carry real business or compliance costs, and where the document corpus contains enough topical overlap to challenge standard retrieval approaches. In these scenarios, the additional training complexity of RAFT pays for itself through measurably higher answer accuracy and reduced hallucination rates — outcomes that matter when your AI system supports regulated processes or high-stakes decision making.
Featured image by Albert Stoynov on Unsplash.