Dilip Singh logo
All posts
RAG SystemsAdvanced2026-06-22·20 min read

Enterprise RAG Pipeline Architecture: From POC to Production

Complete guide to building production RAG systems — chunking strategies, embedding models, hybrid search with Qdrant, reranking, evaluation metrics, and deployment patterns for enterprise scale.

The RAG Gap: Why Your POC Fails at Scale

Every RAG demo looks impressive. You chunk some documents, embed them, retrieve the top-5 results, and the LLM gives a coherent answer. Then you hand it to real users with real documents and everything falls apart: wrong answers with high confidence, missing context from long documents, and retrieval that returns irrelevant noise.

I have built RAG pipelines for enterprise clients processing millions of documents — medical records, legal contracts, financial reports. The gap between a POC and a production RAG system is enormous, and it is almost entirely about the engineering between the embedding and the LLM call.

This guide covers every layer of a production RAG pipeline, from document ingestion to answer generation and evaluation.

Document Processing: The Foundation Nobody Talks About

Before you worry about embeddings or vector databases, you need clean, structured input. Garbage in, garbage out applies doubly to RAG.

Parsing Pipeline

python
from unstructured.partition.auto import partition
from unstructured.chunking.title import chunk_by_title

def process_document(file_path: str, tenant_id: str) -> list[dict]: """Parse any document format into structured elements.""" elements = partition( filename=file_path, strategy="hi_res", pdf_infer_table_structure=True, extract_images_in_pdf=True, )

metadata = { "tenant_id": tenant_id, "source_file": file_path, "processed_at": datetime.utcnow().isoformat(), "doc_type": detect_doc_type(file_path), }

for element in elements: element.metadata.update(metadata)

return elements ```

Chunking Strategies Compared

Chunking is the single most impactful decision in your RAG pipeline. Here is what actually works:

StrategyChunk SizeOverlapBest ForWeakness
Fixed-size512 tokens50 tokensSimple docs, fast prototypingBreaks mid-sentence, ignores structure
Recursive character1000 chars200 charsGeneral purposeNo semantic awareness
Semantic (by embedding similarity)VariableNaturalTechnical docs, mixed contentSlower, requires embedding model
Document-structure awareVariableBy sectionPDFs, reports, legal docsNeeds good parsing
Parent-child (hierarchical)Parent: 2000, Child: 400NoneLong documents, complex queriesMore storage, complex retrieval

For enterprise systems, I recommend parent-child chunking combined with document-structure awareness:

python
def hierarchical_chunk(elements: list, parent_size: int = 2000, child_size: int = 400):
    """Create parent-child chunk hierarchy for better retrieval."""
    parents = []
    children = []

sections = group_by_section(elements) for section in sections: parent_text = "\n".join([e.text for e in section]) parent_id = str(uuid4())

parents.append({ "id": parent_id, "text": parent_text[:parent_size], "metadata": {"type": "parent", "section_title": section[0].metadata.get("title", "")} })

child_texts = split_text(parent_text, child_size, overlap=50) for i, child_text in enumerate(child_texts): children.append({ "id": str(uuid4()), "text": child_text, "parent_id": parent_id, "metadata": {"type": "child", "chunk_index": i} })

return parents, children ```

The key insight: retrieve on children, return parents. Small chunks give precise retrieval; parent chunks give the LLM enough context.

Embedding Models: Choosing the Right One

The embedding model determines your retrieval quality ceiling. No amount of reranking can fix bad embeddings.

ModelDimensionsMax TokensMTEB ScoreLatencyCost
text-embedding-3-large3072819164.6~50ms$0.13/1M
text-embedding-3-small1536819162.3~30ms$0.02/1M
voyage-3-large10243200067.2~80ms$0.18/1M
BGE-M3 (self-hosted)1024819266.1~20msGPU cost only
Cohere embed-v4102451266.5~40ms$0.10/1M

For enterprise deployments where you need data privacy, self-hosting BGE-M3 gives you excellent quality with no data leaving your infrastructure:

python
from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)

def embed_texts(texts: list[str]) -> list[list[float]]: embeddings = model.encode( texts, batch_size=32, max_length=8192, return_dense=True, return_sparse=True, ) return embeddings["dense_vecs"] ```

Hybrid Search with Qdrant

Pure vector search misses keyword-specific queries. Pure keyword search misses semantic meaning. Hybrid search combines both:

python
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection( collection_name="enterprise_docs", vectors_config={ "dense": models.VectorParams(size=1024, distance=models.Distance.COSINE), }, sparse_vectors_config={ "sparse": models.SparseVectorParams( modifier=models.Modifier.IDF, ) }, )

def hybrid_search(query: str, collection: str = "enterprise_docs", top_k: int = 20): """Combine dense and sparse search with Reciprocal Rank Fusion.""" dense_vector = embed_dense(query) sparse_vector = embed_sparse(query)

results = client.query_points( collection_name=collection, prefetch=[ models.Prefetch(query=dense_vector, using="dense", limit=top_k), models.Prefetch(query=sparse_vector, using="sparse", limit=top_k), ], query=models.FusionQuery(fusion=models.Fusion.RRF), limit=top_k, ) return results.points ```

Qdrant Configuration for Production

Critical settings that affect performance and reliability:

yaml
# qdrant-config.yaml
storage:
  storage_path: /data/qdrant
  optimizers:
    default_segment_number: 4
    indexing_threshold: 20000
    memmap_threshold: 50000
  wal:
    wal_capacity_mb: 256
  performance:
    max_search_threads: 0  # auto-detect
  hnsw_index:
    m: 16
    ef_construct: 128
    full_scan_threshold: 10000

Reranking: The Secret Weapon

Reranking takes your top-N retrieved results and re-scores them with a cross-encoder model that considers query-document interaction. This consistently improves retrieval quality by 15-25%.

python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2", max_length=512)

def rerank_results(query: str, results: list[dict], top_k: int = 5) -> list[dict]: """Rerank retrieved results using a cross-encoder.""" if not results: return []

pairs = [(query, r["text"]) for r in results] scores = reranker.predict(pairs)

for result, score in zip(results, scores): result["rerank_score"] = float(score)

ranked = sorted(results, key=lambda x: x["rerank_score"], reverse=True) return ranked[:top_k] ```

RAG Evaluation: Measuring What Matters

You cannot improve what you do not measure. These are the metrics that matter for production RAG:

MetricWhat It MeasuresTargetHow to Compute
FaithfulnessAnswer grounded in retrieved context> 0.9LLM-as-judge: "Is this answer supported by the context?"
Answer RelevanceAnswer addresses the question> 0.85LLM-as-judge: "Does this answer the question?"
Context PrecisionRetrieved docs are relevant> 0.8Fraction of retrieved chunks that are relevant
Context RecallAll relevant docs retrieved> 0.75Fraction of ground-truth docs that appear in retrieval
Hallucination RateClaims not in context< 5%LLM-as-judge with citation checking
python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

eval_dataset = Dataset.from_dict({ "question": questions, "answer": generated_answers, "contexts": retrieved_contexts, "ground_truth": reference_answers, })

results = evaluate( eval_dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall], )

print(f"Faithfulness: {results['faithfulness']:.3f}") print(f"Answer Relevancy: {results['answer_relevancy']:.3f}") print(f"Context Precision: {results['context_precision']:.3f}") print(f"Context Recall: {results['context_recall']:.3f}") ```

Production Deployment Patterns

Multi-Tenant Isolation

Each tenant gets their own Qdrant collection with payload-based filtering:

python
def search_tenant_docs(tenant_id: str, query: str, top_k: int = 10):
    """Search within a tenant's documents only."""
    return client.search(
        collection_name="enterprise_docs",
        query_vector=embed(query),
        query_filter=models.Filter(
            must=[models.FieldCondition(key="tenant_id", match=models.MatchValue(value=tenant_id))]
        ),
        limit=top_k,
    )

Caching for Cost Reduction

Semantic caching reduces both latency and LLM costs by 40-60%:

python
async def cached_rag_query(query: str, tenant_id: str):
    """Check semantic cache before running full RAG pipeline."""
    cache_results = await client.search(
        collection_name="rag_cache",
        query_vector=embed(query),
        query_filter=models.Filter(
            must=[models.FieldCondition(key="tenant_id", match=models.MatchValue(value=tenant_id))]
        ),
        score_threshold=0.95,
        limit=1,
    )
    if cache_results:
        return cache_results[0].payload["answer"]

answer = await full_rag_pipeline(query, tenant_id)

await client.upsert( collection_name="rag_cache", points=[PointStruct( id=str(uuid4()), vector=embed(query), payload={"query": query, "answer": answer, "tenant_id": tenant_id, "cached_at": now()} )] ) return answer ```

Conclusion

Building enterprise RAG is not about picking the right embedding model or vector database — it is about getting every layer right: parsing, chunking, embedding, retrieval, reranking, generation, and evaluation.

The patterns in this guide come from building RAG systems that process millions of documents for real enterprise clients. Start with the parent-child chunking strategy, add hybrid search, layer in reranking, and measure everything with RAGAS.

Need help building a production RAG pipeline? Check out our [RAG architecture services](/services) or [reach out directly](/contact) for a technical consultation. See our [case studies](/case-studies) for examples of RAG systems we have shipped.

DS
Dilip Singh
Lead Software Architect · Hureka Technologies

14+ years building enterprise software and AI systems. Architecting multi-agent AI platforms, RAG pipelines, voice AI, and high-performance SaaS for global clients.