Enterprise RAG Pipeline Architecture: From POC to Production
Complete guide to building production RAG systems — chunking strategies, embedding models, hybrid search with Qdrant, reranking, evaluation metrics, and deployment patterns for enterprise scale.
The RAG Gap: Why Your POC Fails at Scale
Every RAG demo looks impressive. You chunk some documents, embed them, retrieve the top-5 results, and the LLM gives a coherent answer. Then you hand it to real users with real documents and everything falls apart: wrong answers with high confidence, missing context from long documents, and retrieval that returns irrelevant noise.
I have built RAG pipelines for enterprise clients processing millions of documents — medical records, legal contracts, financial reports. The gap between a POC and a production RAG system is enormous, and it is almost entirely about the engineering between the embedding and the LLM call.
This guide covers every layer of a production RAG pipeline, from document ingestion to answer generation and evaluation.
Document Processing: The Foundation Nobody Talks About
Before you worry about embeddings or vector databases, you need clean, structured input. Garbage in, garbage out applies doubly to RAG.
Parsing Pipeline
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_titledef process_document(file_path: str, tenant_id: str) -> list[dict]: """Parse any document format into structured elements.""" elements = partition( filename=file_path, strategy="hi_res", pdf_infer_table_structure=True, extract_images_in_pdf=True, )
metadata = { "tenant_id": tenant_id, "source_file": file_path, "processed_at": datetime.utcnow().isoformat(), "doc_type": detect_doc_type(file_path), }
for element in elements: element.metadata.update(metadata)
return elements ```
Chunking Strategies Compared
Chunking is the single most impactful decision in your RAG pipeline. Here is what actually works:
| Strategy | Chunk Size | Overlap | Best For | Weakness |
|---|---|---|---|---|
| Fixed-size | 512 tokens | 50 tokens | Simple docs, fast prototyping | Breaks mid-sentence, ignores structure |
| Recursive character | 1000 chars | 200 chars | General purpose | No semantic awareness |
| Semantic (by embedding similarity) | Variable | Natural | Technical docs, mixed content | Slower, requires embedding model |
| Document-structure aware | Variable | By section | PDFs, reports, legal docs | Needs good parsing |
| Parent-child (hierarchical) | Parent: 2000, Child: 400 | None | Long documents, complex queries | More storage, complex retrieval |
For enterprise systems, I recommend parent-child chunking combined with document-structure awareness:
def hierarchical_chunk(elements: list, parent_size: int = 2000, child_size: int = 400):
"""Create parent-child chunk hierarchy for better retrieval."""
parents = []
children = []sections = group_by_section(elements) for section in sections: parent_text = "\n".join([e.text for e in section]) parent_id = str(uuid4())
parents.append({ "id": parent_id, "text": parent_text[:parent_size], "metadata": {"type": "parent", "section_title": section[0].metadata.get("title", "")} })
child_texts = split_text(parent_text, child_size, overlap=50) for i, child_text in enumerate(child_texts): children.append({ "id": str(uuid4()), "text": child_text, "parent_id": parent_id, "metadata": {"type": "child", "chunk_index": i} })
return parents, children ```
The key insight: retrieve on children, return parents. Small chunks give precise retrieval; parent chunks give the LLM enough context.
Embedding Models: Choosing the Right One
The embedding model determines your retrieval quality ceiling. No amount of reranking can fix bad embeddings.
| Model | Dimensions | Max Tokens | MTEB Score | Latency | Cost |
|---|---|---|---|---|---|
| text-embedding-3-large | 3072 | 8191 | 64.6 | ~50ms | $0.13/1M |
| text-embedding-3-small | 1536 | 8191 | 62.3 | ~30ms | $0.02/1M |
| voyage-3-large | 1024 | 32000 | 67.2 | ~80ms | $0.18/1M |
| BGE-M3 (self-hosted) | 1024 | 8192 | 66.1 | ~20ms | GPU cost only |
| Cohere embed-v4 | 1024 | 512 | 66.5 | ~40ms | $0.10/1M |
For enterprise deployments where you need data privacy, self-hosting BGE-M3 gives you excellent quality with no data leaving your infrastructure:
from FlagEmbedding import BGEM3FlagModelmodel = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
def embed_texts(texts: list[str]) -> list[list[float]]: embeddings = model.encode( texts, batch_size=32, max_length=8192, return_dense=True, return_sparse=True, ) return embeddings["dense_vecs"] ```
Hybrid Search with Qdrant
Pure vector search misses keyword-specific queries. Pure keyword search misses semantic meaning. Hybrid search combines both:
from qdrant_client import QdrantClient, modelsclient = QdrantClient(url="http://localhost:6333")
client.create_collection( collection_name="enterprise_docs", vectors_config={ "dense": models.VectorParams(size=1024, distance=models.Distance.COSINE), }, sparse_vectors_config={ "sparse": models.SparseVectorParams( modifier=models.Modifier.IDF, ) }, )
def hybrid_search(query: str, collection: str = "enterprise_docs", top_k: int = 20): """Combine dense and sparse search with Reciprocal Rank Fusion.""" dense_vector = embed_dense(query) sparse_vector = embed_sparse(query)
results = client.query_points( collection_name=collection, prefetch=[ models.Prefetch(query=dense_vector, using="dense", limit=top_k), models.Prefetch(query=sparse_vector, using="sparse", limit=top_k), ], query=models.FusionQuery(fusion=models.Fusion.RRF), limit=top_k, ) return results.points ```
Qdrant Configuration for Production
Critical settings that affect performance and reliability:
# qdrant-config.yaml
storage:
storage_path: /data/qdrant
optimizers:
default_segment_number: 4
indexing_threshold: 20000
memmap_threshold: 50000
wal:
wal_capacity_mb: 256
performance:
max_search_threads: 0 # auto-detect
hnsw_index:
m: 16
ef_construct: 128
full_scan_threshold: 10000
Reranking: The Secret Weapon
Reranking takes your top-N retrieved results and re-scores them with a cross-encoder model that considers query-document interaction. This consistently improves retrieval quality by 15-25%.
from sentence_transformers import CrossEncoderreranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512)
def rerank_results(query: str, results: list[dict], top_k: int = 5) -> list[dict]: """Rerank retrieved results using a cross-encoder.""" if not results: return []
pairs = [(query, r["text"]) for r in results] scores = reranker.predict(pairs)
for result, score in zip(results, scores): result["rerank_score"] = float(score)
ranked = sorted(results, key=lambda x: x["rerank_score"], reverse=True) return ranked[:top_k] ```
RAG Evaluation: Measuring What Matters
You cannot improve what you do not measure. These are the metrics that matter for production RAG:
| Metric | What It Measures | Target | How to Compute |
|---|---|---|---|
| Faithfulness | Answer grounded in retrieved context | > 0.9 | LLM-as-judge: "Is this answer supported by the context?" |
| Answer Relevance | Answer addresses the question | > 0.85 | LLM-as-judge: "Does this answer the question?" |
| Context Precision | Retrieved docs are relevant | > 0.8 | Fraction of retrieved chunks that are relevant |
| Context Recall | All relevant docs retrieved | > 0.75 | Fraction of ground-truth docs that appear in retrieval |
| Hallucination Rate | Claims not in context | < 5% | LLM-as-judge with citation checking |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataseteval_dataset = Dataset.from_dict({ "question": questions, "answer": generated_answers, "contexts": retrieved_contexts, "ground_truth": reference_answers, })
results = evaluate( eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], )
print(f"Faithfulness: {results['faithfulness']:.3f}") print(f"Answer Relevancy: {results['answer_relevancy']:.3f}") print(f"Context Precision: {results['context_precision']:.3f}") print(f"Context Recall: {results['context_recall']:.3f}") ```
Production Deployment Patterns
Multi-Tenant Isolation
Each tenant gets their own Qdrant collection with payload-based filtering:
def search_tenant_docs(tenant_id: str, query: str, top_k: int = 10):
"""Search within a tenant's documents only."""
return client.search(
collection_name="enterprise_docs",
query_vector=embed(query),
query_filter=models.Filter(
must=[models.FieldCondition(key="tenant_id", match=models.MatchValue(value=tenant_id))]
),
limit=top_k,
)
Caching for Cost Reduction
Semantic caching reduces both latency and LLM costs by 40-60%:
async def cached_rag_query(query: str, tenant_id: str):
"""Check semantic cache before running full RAG pipeline."""
cache_results = await client.search(
collection_name="rag_cache",
query_vector=embed(query),
query_filter=models.Filter(
must=[models.FieldCondition(key="tenant_id", match=models.MatchValue(value=tenant_id))]
),
score_threshold=0.95,
limit=1,
)
if cache_results:
return cache_results[0].payload["answer"]answer = await full_rag_pipeline(query, tenant_id)
await client.upsert( collection_name="rag_cache", points=[PointStruct( id=str(uuid4()), vector=embed(query), payload={"query": query, "answer": answer, "tenant_id": tenant_id, "cached_at": now()} )] ) return answer ```
Conclusion
Building enterprise RAG is not about picking the right embedding model or vector database — it is about getting every layer right: parsing, chunking, embedding, retrieval, reranking, generation, and evaluation.
The patterns in this guide come from building RAG systems that process millions of documents for real enterprise clients. Start with the parent-child chunking strategy, add hybrid search, layer in reranking, and measure everything with RAGAS.
Need help building a production RAG pipeline? Check out our [RAG architecture services](/services) or [reach out directly](/contact) for a technical consultation. See our [case studies](/case-studies) for examples of RAG systems we have shipped.