Blog
Fine-Tuning Small Language Models with Domain-Specific Data On-Premises
A practical guide to fine-tuning small language models using proprietary domain data entirely on-premises, covering data preparation, training infrastructure, and evaluation strategies.
Why Fine-Tuning Beats Prompting for Domain Tasks
Small language models (SLMs) in the 1-7 billion parameter range have become remarkably capable at general tasks. Models like Mistral 7B, Phi-3, and Llama 3.2 deliver strong performance on standard benchmarks while running comfortably on modest hardware. But general capability and domain expertise are different things. When your use case involves specialized terminology, proprietary formats, or industry-specific reasoning patterns, a general-purpose SLM will struggle regardless of how carefully you craft your prompts.
Fine-tuning adapts the model's weights to your domain, teaching it patterns that prompt engineering alone cannot convey. A fine-tuned SLM can learn your organization's document conventions, technical vocabulary, classification schemes, and domain-specific logic. The result is a model that produces domain-appropriate outputs consistently, without requiring elaborate prompting strategies that consume context window space and add latency.
Doing this on-premises means your proprietary training data — the competitive advantage encoded in your documents, processes, and decisions — never leaves your infrastructure. For organizations in regulated industries or with strict intellectual property policies, this is not optional; it is a prerequisite.
Preparing Domain Data for Fine-Tuning
The quality of your fine-tuned model depends entirely on the quality of your training data. Garbage in, garbage out applies with particular force to fine-tuning, where a small dataset has outsized influence on the model's behavior.
Data collection starts with identifying the specific task you want the model to perform. "Understanding our domain" is too vague — "classifying customer support tickets into 15 categories" or "generating structured summaries from engineering reports" gives you a concrete target. The task definition determines what your training examples look like: input-output pairs where the input matches what the model will see in production and the output matches what you want it to produce.
Data formatting for instruction fine-tuning typically follows a chat template with system, user, and assistant roles. Structure each example as a conversation turn where the user message contains the input and the assistant message contains the desired output. Keep the system message consistent across examples to establish the model's persona and constraints. Most fine-tuning frameworks expect data in JSONL format with a "messages" array following the OpenAI chat format, which has become a de facto standard.
Data quality checks are non-negotiable. Review a random sample of at least 100 examples manually. Look for inconsistencies in formatting, contradictions between examples, examples where the output does not actually follow from the input, and label noise. Remove or correct problematic examples. A clean dataset of 500 high-quality examples often outperforms a noisy dataset of 5,000. For classification tasks, verify that your class distribution in the training data roughly matches what the model will encounter in production, or apply appropriate sampling strategies to handle imbalance.
Data augmentation can expand a limited dataset without collecting new examples. Techniques include paraphrasing inputs while keeping outputs fixed, creating variations of existing examples by modifying non-essential details, and using a larger model to generate candidate examples that you then verify manually. Be cautious with synthetic data — always review generated examples before including them in your training set.
Infrastructure Requirements for On-Premises Fine-Tuning
Fine-tuning an SLM on-premises is significantly less demanding than pre-training, but it still requires thoughtful infrastructure planning. The hardware requirements depend on the model size, training approach, and dataset scale.
For full fine-tuning of a 7B parameter model, you need at least one GPU with 40GB or more of VRAM — an NVIDIA A100 40GB or A6000 48GB handles this comfortably. Full fine-tuning updates all model parameters, which requires holding the model weights, optimizer states, and gradients in GPU memory simultaneously.
For most practical purposes, LoRA (Low-Rank Adaptation) or QLoRA dramatically reduce hardware requirements. LoRA freezes the original model weights and trains small adapter matrices that modify the model's behavior. QLoRA goes further by quantizing the base model to 4-bit precision, reducing memory requirements enough to fine-tune a 7B model on a single GPU with 16GB of VRAM — an NVIDIA RTX 4090 or T4 is sufficient. The quality trade-off is minimal for most domain adaptation tasks.
On the software side, the Hugging Face Transformers library combined with PEFT (Parameter-Efficient Fine-Tuning) provides the most mature and well-documented fine-tuning stack. For training orchestration, Axolotl wraps these libraries into a configuration-driven workflow that handles data loading, training configuration, and checkpoint management. All these tools run entirely on-premises without any cloud dependencies.
Plan for storage beyond GPU compute. Model checkpoints accumulate during training — a 7B model checkpoint is approximately 14GB at fp16 precision, and you will want to keep multiple checkpoints for comparison. Budget at least 200GB of fast SSD storage for a single fine-tuning run, more if you are iterating across multiple configurations.
The Fine-Tuning Process Step by Step
Step 1: Establish a baseline. Before fine-tuning, evaluate the base model on your task using a held-out test set. This gives you a concrete performance number to improve upon. Run the base model on 50-100 representative examples and score the outputs using your task-specific metrics. For classification, measure accuracy and per-class F1. For generation tasks, use a combination of automated metrics (ROUGE, BERTScore) and human evaluation of a sample.
Step 2: Configure the training run. Set your LoRA rank (r=16 is a solid starting point for domain adaptation), target the attention layers (q_proj, v_proj at minimum; adding k_proj and o_proj can improve results for complex tasks), set learning rate to 2e-4 with a cosine schedule, and train for 3-5 epochs. Batch size depends on your GPU memory — start with 4 and adjust. Enable gradient checkpointing if memory is tight.
Step 3: Monitor training. Track training loss, validation loss, and your task-specific metrics at each evaluation step. Watch for overfitting — if training loss continues to decrease while validation loss plateaus or increases, you are memorizing the training data rather than learning generalizable patterns. With small datasets, overfitting can happen within a single epoch. Use Weights & Biases (self-hosted) or MLflow for experiment tracking.
Step 4: Evaluate the fine-tuned model. Run the fine-tuned model on the same test set used for the baseline evaluation. Compare metrics directly. Also perform qualitative evaluation — read through 20-30 outputs and assess whether the model has actually learned the domain patterns or just superficial formatting. Test edge cases: inputs that are ambiguous, out of distribution, or adversarial. A model that performs well on average but fails on edge cases may not be production-ready.
Step 5: Iterate or deploy. If performance is insufficient, diagnose the cause before re-running training with different hyperparameters. Common issues include insufficient or low-quality training data (add more examples), overfitting (reduce epochs, increase regularization), and undertrained adapters (increase LoRA rank or add target modules). If the model meets your performance criteria, merge the LoRA adapters into the base model weights and export a single model file for deployment.
Evaluation Strategies That Actually Work
Evaluating fine-tuned language models is harder than evaluating classifiers because the output space is open-ended. A generated summary can be correct in many different ways, and automated metrics only capture some dimensions of quality.
Build a task-specific evaluation suite rather than relying on general benchmarks. If your model classifies documents, create a test set with at least 20 examples per class, including ambiguous cases near class boundaries. If your model generates reports, define specific criteria: does it include all required fields, does it use correct terminology, does it follow the expected structure, is the content factually grounded in the input.
Combine automated and human evaluation. Automated metrics provide fast, reproducible scores suitable for comparing training runs. Human evaluation catches quality dimensions that metrics miss — fluency, appropriateness, factual grounding, and whether the output would actually be useful in practice. Establish a rubric so human evaluators score consistently, and evaluate inter-annotator agreement on a shared sample before trusting the scores.
Test for regression on general capabilities. Fine-tuning can cause catastrophic forgetting, where the model loses general skills while acquiring domain expertise. Run the fine-tuned model on a small set of general-purpose tasks (basic reasoning, instruction following, harmlessness) to verify it still behaves well outside your domain. If you observe significant regression, reduce the learning rate or the number of training epochs.
From Fine-Tuned Model to Production Service
A fine-tuned model is only valuable when it serves predictions reliably. Deploy the merged model using an inference server like vLLM, llama.cpp, or TGI (Text Generation Inference), all of which run on-premises without external dependencies. vLLM offers the best throughput for concurrent requests through continuous batching; llama.cpp excels when running on CPU-only nodes or limited GPU hardware through aggressive quantization.
Implement a versioning strategy from the start. Tag each deployed model with its training dataset version, fine-tuning configuration, and evaluation metrics. Store this metadata in your model registry alongside the model weights. When you retrain — and you will need to retrain as your domain evolves — you need to compare the new version against the current production version on the same evaluation suite before swapping.
Monitor the fine-tuned model's performance in production using the same metrics you used during evaluation. Log inputs, outputs, and any user feedback. Watch for domain drift — the real-world distribution of inputs will shift over time, and your fine-tuning data may become stale. Set up alerts for when production performance metrics fall below your acceptance thresholds, and feed production data back into your training dataset for the next fine-tuning iteration.
Featured image by Sam Moghadam on Unsplash.