What is RAG (and why everyone's building it)
Retrieval-Augmented Generation (RAG) combines a large language model with an external knowledge base. Instead of baking all your company's data into an expensive fine-tuned model, RAG retrieves relevant documents at inference time and feeds them to the LLM as context. The result: an AI assistant that actually knows your business — your documentation, support history, product catalog, legal contracts.
In 2025, RAG pipeline development has become one of the most requested capabilities in enterprise AI. The appeal is obvious: connect Claude or GPT-4 to your internal knowledge and ship a working product in weeks. But there's a wide gap between a 20-minute demo and a system you'd bet a production SLA on.
Why most RAG demos fail in production
The demo works. You chunk some PDFs, embed them into Pinecone, and call the LLM. Answers look impressive. Then you push to production and things start breaking quietly:
- The retriever pulls the wrong chunks 20–40% of the time on real queries
- The model confidently answers from hallucinated context that's almost-but-not-quite in the docs
- Latency spikes unpredictably under concurrent load
- Quality degrades over time and no one knows why
- A new engineer joins and can't debug anything because there's no observability
The root cause is almost always the same: the demo was built to impress, not to be debugged, monitored, or improved. RAG is deceptively easy to prototype and deceptively hard to productionize.
5 key decisions that determine production success
1. Chunking strategy
Fixed-size chunking (512 tokens) is fine for demos. For production, use semantic chunking or document-structure-aware splitting that respects headings, sections, and tables. The difference in retrieval precision can be 2× or more — and it compounds with every other component in your pipeline.
2. Embedding model selection
OpenAI's text-embedding-3-large is strong, but not always the best fit for domain-specific content. Run evals on your actual data before committing. Consider BGE-M3 or E5-Mistral for cost efficiency at scale. A model that's 15% cheaper and 5% less accurate might be the right call if you're doing 10M embeddings per month.
3. Retrieval architecture
Naive similarity search misses too much. Production-grade retrieval uses: hybrid search (dense + sparse BM25 combined), reranking (Cohere Rerank or a cross-encoder), and HyDE (Hypothetical Document Embeddings) for complex or abstract queries. Each layer adds latency — measure the tradeoff on your actual query distribution.
4. Context window management
More context isn't always better. Stuffing 8k tokens of retrieved text into every prompt degrades answer quality and inflates cost. Implement context compression (LLMLingua or similar) and relevance filtering before the LLM call. Set a hard token budget and enforce it.
5. Guardrails and fallback behavior
What happens when no relevant documents are retrieved? Build explicit "I don't know" flows. Set retrieval confidence thresholds — if your top chunk similarity score is below 0.72, don't hallucinate an answer. Log every query where the model hedges. These edge cases become your most valuable training data.
Stack: LangChain vs LlamaIndex vs custom
LangChain has the broadest integration surface and the largest community. It's the right choice for teams prototyping quickly or connecting to many different data sources. The abstraction overhead is real though — debugging production issues through three layers of chains and callbacks is painful.
LlamaIndex is purpose-built for RAG and shines at complex document hierarchies, agentic retrieval, and knowledge graph construction. We reach for it on serious RAG projects where retrieval quality is the primary constraint.
Custom pipelines — built directly on LiteLLM, Qdrant or Weaviate, and your own orchestration layer — give the most control and the lowest runtime overhead. The right call for teams with the engineering bandwidth to own the stack and the query volumes that make efficiency matter.
Monitoring your RAG pipeline
A RAG system without observability is a liability you're accumulating silently. You need metrics at every layer:
- Retrieval quality: precision@k, recall@k, mean reciprocal rank — measured against a labeled eval set you build before launch
- Generation faithfulness: is the answer actually grounded in the retrieved documents? Tools like RAGAS automate this
- Latency percentiles: p50, p95, p99 across retrieval, reranking, and LLM calls — separately
- User feedback signals: explicit (thumbs, corrections) and implicit (re-asks, session abandonment)
- Cost per query: embedding calls, LLM tokens, reranker calls — broken down and trending
Tooling we use: Langfuse for tracing and prompt management, Arize Phoenix for embedding drift detection, RAGAS for automated eval. Log every query-response pair with full retrieved context — you'll need it when something breaks at 2am and you need to understand what the system actually saw.
The teams that ship reliable RAG systems aren't necessarily using better models. They're the ones who built their eval suite before launch, who know exactly which query types fail, and who have a feedback loop that improves the system over time.