Notes on building production RAG systems

Jan 8, 2025 • Muhammad Zeeshan

Retrieval-augmented generation looks straightforward in a notebook: chunk documents, embed, retrieve, prompt, answer. Production is messier: especially when documents update daily and users expect citations they can trust.

1. Chunking is a product decision

Chunk size affects recall, latency, and citation quality. Legal and policy docs often need section-aware splits; chat logs need session boundaries. Test chunk strategies on real failure cases, not only on happy-path queries.

2. Hybrid retrieval beats religion

Pure vector search misses exact identifiers: SKUs, error codes, policy numbers. Hybrid keyword + vector retrieval with reranking consistently outperforms either alone in the systems I’ve shipped.

3. Evaluate retrieval and generation separately

If retrieval fails, no prompt engineering saves you. Build a golden set of questions with expected source passages. Track recall@k before you tune the LLM.

4. Citations are UX, not decoration

Users trust answers with clickable sources. Surface the passage, not just the filename. When nothing relevant retrieves, say so: don’t hallucinate confidence.

Summary

RAG in production is an indexing, evaluation, and UX problem as much as an LLM problem. Treat retrieval as first-class infrastructure and you’ll ship faster.