Notes on building production RAG systems

    Retrieval-augmented generation looks straightforward in a notebook: chunk documents, embed, retrieve, prompt, answer. Production is messier: especially when documents update daily and users expect citations they can trust.

    1. Chunking is a product decision

    Chunk size affects recall, latency, and citation quality. Legal and policy docs often need section-aware splits; chat logs need session boundaries. Test chunk strategies on real failure cases, not only on happy-path queries.

    2. Hybrid retrieval beats religion

    Pure vector search misses exact identifiers: SKUs, error codes, policy numbers. Hybrid keyword + vector retrieval with reranking consistently outperforms either alone in the systems I’ve shipped.

    3. Evaluate retrieval and generation separately

    If retrieval fails, no prompt engineering saves you. Build a golden set of questions with expected source passages. Track recall@k before you tune the LLM.

    4. Citations are UX, not decoration

    Users trust answers with clickable sources. Surface the passage, not just the filename. When nothing relevant retrieves, say so: don’t hallucinate confidence.

    Summary

    RAG in production is an indexing, evaluation, and UX problem as much as an LLM problem. Treat retrieval as first-class infrastructure and you’ll ship faster.