Series: AI Systems at Scale · Part 2 of 5
RAG Pipeline Design: Chunking, Embeddings & Qdrant at Production Scale
Everything I learned building production RAG systems. Optimal chunk sizes, embedding model selection, Qdrant HNSW tuning, hybrid search, and reranking strategies.
The Three Pillars of Production RAG
After building RAG systems for 10+ enterprise clients, I've learned that retrieval quality depends on three things: how you chunk documents, which embeddings you use, and how you search. Get any one wrong and your AI gives confidently wrong answers.
Chunking Strategy
The most common mistake is using fixed-size character chunking. Instead use recursive splitting at natural boundaries:
from langchain.text_splitter import RecursiveCharacterTextSplittersplitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ". ", " ", ""] ) chunks = splitter.split_text(document) ```
Embedding Model Selection
| Model | Dimensions | Speed | Quality | Cost |
|---|---|---|---|---|
| all-MiniLM-L6-v2 | 384 | Fast | Good | Free |
| all-mpnet-base-v2 | 768 | Medium | Better | Free |
| text-embedding-3-small | 1536 | API | Best | $0.02/1M |
| nomic-embed-text | 768 | Medium | Very Good | Free |
For most enterprise RAG, all-mpnet-base-v2 gives the best quality/cost tradeoff when running self-hosted.
Qdrant Configuration
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, HnswConfigDiffclient = QdrantClient("localhost", port=6333)
client.create_collection( collection_name="enterprise_docs", vectors_config=VectorParams(size=768, distance=Distance.COSINE), hnsw_config=HnswConfigDiff( m=16, ef_construct=100, ), ) ```
Hybrid Search with Reranking
Pure vector search misses exact keyword matches. Combine dense + sparse, then rerank:
from sentence_transformers import CrossEncoderresults = client.search( collection_name="enterprise_docs", query_vector=dense_vector, limit=20, with_payload=True, )
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") scores = reranker.predict([(query, r.payload["text"]) for r in results]) ranked = sorted(zip(results, scores), key=lambda x: x[1], reverse=True) top_5 = [r for r, s in ranked[:5]] ```
Common Mistakes
- 1Too-small chunks — Chunks under 100 tokens lose context. 400–600 tokens is the sweet spot.
- 2No metadata filtering — Always store document_id, tenant_id, date, and section in payload.
- 3Skipping reranking — First-pass retrieval (k=20) + cross-encoder reranking (top-5) consistently outperforms k=5 direct retrieval.
- 4Not monitoring retrieval quality — Use LangFuse to track which retrieved chunks actually appeared in LLM responses.