Dilip Singh logo
All posts
RAG SystemsAdvanced2026-01-30·11 min read

Fine-Tuning Embeddings for Domain-Specific RAG: A 20% Recall Jump

Generic embeddings (BGE, OpenAI) leave 20% recall on the table for domain text. Learn how to mine training pairs from your own documents and fine-tune sentence-transformers for medical, legal, or financial RAG.

Why Generic Embeddings Fall Short

OpenAI's text-embedding-3 and BGE are trained on internet text. They're great for general queries but they don't know that "MI" means myocardial infarction, "AKI" means acute kidney injury, or that "Section 230" is a US law about platform liability.

For domain-specific RAG, fine-tuning your own embeddings on your domain corpus gives 15–25% improvement in retrieval recall.

Strategy: Mine Pairs from Existing Documents

You don't need labeled data. Generate (query, positive_chunk) pairs from your existing corpus:

python
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# 1. Use a teacher LLM to generate synthetic queries for each chunk async def generate_queries(chunk: str, n: int = 3) -> list[str]: prompt = f"""Generate {n} different questions that this passage answers. Return as a JSON array of strings.

Passage: {chunk} """ response = await llm.complete(prompt) return json.loads(response)

# 2. Build training pairs pairs = [] for chunk in domain_chunks: queries = await generate_queries(chunk, n=3) for q in queries: pairs.append(InputExample(texts=[q, chunk])) ```

Mining Hard Negatives

Random negatives are too easy. Mine hard negatives — chunks that look similar but aren't the right answer:

python
from sentence_transformers import util

base_model = SentenceTransformer("BAAI/bge-base-en-v1.5") chunk_embeddings = base_model.encode([p.texts[1] for p in pairs], convert_to_tensor=True)

for i, pair in enumerate(pairs): q_emb = base_model.encode(pair.texts[0], convert_to_tensor=True) similarities = util.cos_sim(q_emb, chunk_embeddings)[0] similarities[i] = -1 # exclude the true positive hard_neg_idx = similarities.argmax().item() pair.texts.append(pairs[hard_neg_idx].texts[1]) ```

Fine-Tuning Loop

python
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
train_dataloader = DataLoader(pairs, shuffle=True, batch_size=32)

# Multiple Negatives Ranking Loss — uses in-batch negatives + the hard neg train_loss = losses.MultipleNegativesRankingLoss(model)

model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=int(len(train_dataloader) * 0.1), output_path="./bge-medical-v1", show_progress_bar=True, ) ```

Evaluation

Always evaluate against a held-out eval set with measurable metrics:

python
from sentence_transformers.evaluation import InformationRetrievalEvaluator

evaluator = InformationRetrievalEvaluator( queries=eval_queries, # {query_id: query_text} corpus=eval_corpus, # {doc_id: doc_text} relevant_docs=eval_relevant, # {query_id: set(relevant_doc_ids)} name="medical-eval", show_progress_bar=True, )

result = evaluator(model, output_path="./eval-results") ```

Typical Improvements

DomainGeneric BGE Recall@5Fine-tuned Recall@5
Medical0.620.81
Legal0.580.79
Financial filings0.660.84

Practical Tips

  1. 1Quality > quantity — 5,000 well-mined pairs > 50,000 noisy ones
  2. 2Filter by length — Drop ultra-short or ultra-long chunks before training
  3. 3Domain LLM for query generation — Generic GPT can generate generic-sounding questions; use a domain-tuned LLM where possible
  4. 4Validate by humans — Sample 100 generated pairs and read them. If they look unnatural, regenerate
  5. 5Version your embeddings — Re-indexing 50M vectors is expensive; always know which model produced which collection
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.