Eval-first RAG
Teams ship retrieval before they ship evals: then wonder why quality regresses.
Golden datasets, offline regression, and human review loops are becoming table stakes. The best RAG projects treat eval infrastructure as part of v1, not a post-launch patch.
Agent graphs over chains
Linear chains are giving way to explicit state machines and graph orchestration.
LangGraph-style workflows, checkpointing, and human-in-the-loop steps map better to real business processes than one-shot prompt chains.
Right-sized models
Routing between small local models and frontier APIs is a cost and latency win.
Classification, extraction, and guardrails often run fine on smaller models; reserve frontier calls for reasoning-heavy steps.
LLM observability as product
Tracing token cost, latency, and failure modes is product analytics for AI apps.
Dashboards for drift, hallucination rate, and tool-call success are what separate demos from systems operators can trust.
Multimodal in the loop
Vision + document parsing is entering standard agent toolkits.
Invoices, screenshots, and PDFs flow through the same agent frameworks as text: extraction and verification steps matter more than the base model choice.
Open weights in production
Fine-tuned open models are viable for domain-specific pipelines with data sensitivity.
When privacy, cost, or latency dominates, self-hosted inference with vLLM/Ollama plus a thin API layer is a real architecture: not just a research exercise.