Building LLM applications for production
Shipping an LLM feature means more than calling chat.completions. Here’s a checklist I use before handing projects to clients.
1. Instrument everything
Log prompts, retrieved context hashes, latency, token usage, and user feedback. You cannot improve what you cannot see: and you cannot debug production incidents from vibes.
2. Version prompts like code
Store prompts in git, review changes, and tag releases with model versions. “It worked yesterday” usually means something upstream changed.
3. Guardrails at the boundary
Validate inputs and outputs: PII filters, max lengths, refusal patterns, and structured output schemas where possible. Fail closed when confidence is low.
4. Load-test the unhappy path
Providers rate-limit. Context windows overflow. Tools timeout. Design retries, fallbacks, and user-visible degradation: not silent failure.
Summary
Production LLM apps are software systems. Reliability comes from observability, versioning, and explicit failure modes: not from a bigger model alone.