Building LLM applications for production

    Shipping an LLM feature means more than calling chat.completions. Here’s a checklist I use before handing projects to clients.

    1. Instrument everything

    Log prompts, retrieved context hashes, latency, token usage, and user feedback. You cannot improve what you cannot see: and you cannot debug production incidents from vibes.

    2. Version prompts like code

    Store prompts in git, review changes, and tag releases with model versions. “It worked yesterday” usually means something upstream changed.

    3. Guardrails at the boundary

    Validate inputs and outputs: PII filters, max lengths, refusal patterns, and structured output schemas where possible. Fail closed when confidence is low.

    4. Load-test the unhappy path

    Providers rate-limit. Context windows overflow. Tools timeout. Design retries, fallbacks, and user-visible degradation: not silent failure.

    Summary

    Production LLM apps are software systems. Reliability comes from observability, versioning, and explicit failure modes: not from a bigger model alone.