Skip to content

RAG Systems in Production: Lessons Learned

Practical guidance on building retrieval-augmented generation systems that actually work — chunking strategies, vector DB selection, retrieval quality, and prompt engineering.

2025-04-05 11 min read
RAGLLMVector DBArchitecture

Everyone's building RAG systems. Few are building them well. After deploying several production RAG pipelines across enterprise environments, I've collected a set of hard-won lessons about what actually matters — and what's just hype.

Chunking Is the Hardest Part

Chunking strategy has more impact on retrieval quality than your choice of embedding model or vector database. The goal is to create chunks that are semantically self-contained — a chunk should answer a question without requiring context from neighboring chunks.

Chunking strategies ranked by effectiveness:

  • Semantic chunking (split by topic shifts) — best quality, highest complexity
  • Recursive character splitting with overlap — good balance of quality and simplicity
  • Sentence-window chunking — embed sentences, retrieve surrounding context
  • Fixed-size chunking — simple but produces poor boundaries
python
# Recursive splitting with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""],
    length_function=len,
)

# Key insight: chunk_size should match your
# embedding model's sweet spot (usually 256-512 tokens)

Test your chunking by manually reviewing 50 random chunks. If you can't understand what a chunk is about without seeing its neighbors, your chunks are too small or splitting at wrong boundaries.

Retrieval Quality > Generation Quality

If you retrieve the wrong chunks, no amount of prompt engineering will save you. I spend 80% of optimization time on retrieval and 20% on generation. The most effective improvements I've found are hybrid search (combining dense and sparse retrieval) and re-ranking.

python
# Hybrid search: dense + sparse retrieval
dense_results = vector_db.similarity_search(query, k=20)
sparse_results = bm25_index.search(query, k=20)

# Reciprocal Rank Fusion
combined = reciprocal_rank_fusion(
    [dense_results, sparse_results],
    k=60,  # RRF constant
)

# Re-rank top candidates with cross-encoder
reranked = cross_encoder.rerank(query, combined[:20])
final_context = reranked[:5]

Vector Database Selection

The vector DB market is noisy. Here's my pragmatic take based on production experience.

What matters in practice:

  • Filtering + vector search combined — you almost always need metadata filters alongside similarity search
  • Operational simplicity — managed services beat self-hosted for most teams
  • Update performance — if your corpus changes frequently, write speed matters
  • Cost at scale — embedding storage costs grow linearly with corpus size

Prompt Engineering for RAG

The generation prompt is where you prevent hallucination. Three rules: explicitly instruct the model to only use provided context, include a 'say I don't know' fallback, and structure the context with clear delimiters.

python
system_prompt = """Answer the question using ONLY the provided context.
If the context doesn't contain enough information, say
"I don't have enough information to answer that."

Do not use prior knowledge. Do not speculate.

Context:
---
{context}
---

Question: {question}
Answer:"""

Evaluation Framework

You can't improve what you can't measure. I evaluate RAG systems on three dimensions: retrieval precision (are the right chunks being found?), faithfulness (does the answer stick to the context?), and relevance (does the answer address the question?). Automated evaluation with LLM-as-judge works surprisingly well for faithfulness and relevance.

Build your evaluation set before building your pipeline. 50-100 question-answer pairs with known source documents will save you weeks of trial and error.

What I'd Do Differently

Start with the simplest possible pipeline (basic chunking, single embedding model, top-k retrieval) and add complexity only when evaluation metrics demand it. Most RAG failures I've debugged come from over-engineering the pipeline before understanding the data.