Dilip Singh logo
All posts
RAG SystemsAdvanced2025-03-10·18 min read

RAG Pipeline Design: Chunking, Embeddings & Qdrant at Production Scale

Everything I learned building production RAG systems. Optimal chunk sizes, embedding model selection, Qdrant HNSW tuning, hybrid search, and reranking strategies.

The Three Pillars of Production RAG

After building RAG systems for 10+ enterprise clients, I've learned that retrieval quality depends on three things: how you chunk documents, which embeddings you use, and how you search. Get any one wrong and your AI gives confidently wrong answers.

Chunking Strategy

The most common mistake is using fixed-size character chunking. Instead use recursive splitting at natural boundaries:

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_text(document) ```

Embedding Model Selection

ModelDimensionsSpeedQualityCost
all-MiniLM-L6-v2384FastGoodFree
all-mpnet-base-v2768MediumBetterFree
text-embedding-3-small1536APIBest$0.02/1M
nomic-embed-text768MediumVery GoodFree

For most enterprise RAG, all-mpnet-base-v2 gives the best quality/cost tradeoff when running self-hosted.

Qdrant Configuration

python
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiff

client = QdrantClient("localhost", port=6333)

client.create_collection( collection_name="enterprise_docs", vectors_config=VectorParams(size=768, distance=Distance.COSINE), hnsw_config=HnswConfigDiff( m=16, ef_construct=100, ), ) ```

Hybrid Search with Reranking

Pure vector search misses exact keyword matches. Combine dense + sparse, then rerank:

python
from sentence_transformers import CrossEncoder

results = client.search( collection_name="enterprise_docs", query_vector=dense_vector, limit=20, with_payload=True, )

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") scores = reranker.predict([(query, r.payload["text"]) for r in results]) ranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True) top_5 = [r for r, s in ranked[:5]] ```

Common Mistakes

  1. 1Too-small chunks — Chunks under 100 tokens lose context. 400–600 tokens is the sweet spot.
  2. 2No metadata filtering — Always store document_id, tenant_id, date, and section in payload.
  3. 3Skipping reranking — First-pass retrieval (k=20) + cross-encoder reranking (top-5) consistently outperforms k=5 direct retrieval.
  4. 4Not monitoring retrieval quality — Use LangFuse to track which retrieved chunks actually appeared in LLM responses.
DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.