RAG
A technique that retrieves relevant documents before answering, so the model grounds its output in real data instead of fabricating.
A technique that retrieves relevant documents before answering, so the model grounds its output in real data instead of fabricating.
Basic
RAG pipelines have two steps. First, a retriever (often vector search over embeddings) finds relevant documents for the user's query. Second, the model generates an answer conditioned on those documents. RAG reduces hallucination and enables citing sources. It is the most common architecture for enterprise AI that needs to answer from proprietary data.
Deep
A RAG system has three components: an embedding model (e.g., text-embedding-3-large, Voyage, Cohere embed), a vector database (Pinecone, Weaviate, pgvector, Chroma), and a generation model (any LLM). Query → embedding → similarity search → top-k documents → prompt template with retrieved context → generation. Quality hinges on: chunk size, chunk overlap, retrieval ranking, reranking (Cohere Rerank, Voyage Rerank), and prompt-template design. Advanced patterns include: hybrid search (sparse + dense), HyDE (hypothetical document embeddings), query decomposition, and graph RAG. Cost dominates on embedding (re-embedding every new document) and inference (larger contexts from retrieved docs).
Expert
RAG effectiveness is bounded by retrieval recall and generator grounding. Recall drops when chunking splits semantic units; fix via sentence-transformer-aware chunkers or structure-aware splitters. Ranking quality improves with cross-encoder rerankers but adds latency. Context window utilization matters: stuffing too many documents degrades performance due to "lost in the middle" attention patterns. Evaluation uses RAGAS or TruLens metrics: faithfulness, answer relevance, context precision, context recall. Graph RAG builds a knowledge graph from documents and traverses it for multi-hop queries. Self-RAG lets the model decide when to retrieve. Agentic RAG adds tool use for complex query decomposition.
Depending on why you're here
- ·Retrieve + generate pipeline with embedding model + vector DB + LLM
- ·Faithfulness, context precision, context recall are core metrics
- ·HyDE, hybrid search, rerankers push quality up · graph RAG for multi-hop
- ·Start with OpenAI embeddings + pgvector + Claude/GPT-4
- ·Chunk size 300-800 tokens with 10-20% overlap is a safe default
- ·Add a reranker if retrieval recall is low
- ·RAG is the dominant enterprise AI architecture · high adoption
- ·Vector DB market is consolidating · MongoDB Atlas Search, pgvector winning
- ·Embedding + rerank is commoditizing · pricing dropping fast
- ·AI that reads your docs before answering
- ·Why ChatGPT lets you upload files · RAG under the hood
- ·Reduces the "made-up answer" problem
Often confused with
Fine-tuning changes model weights for new behavior. RAG keeps weights fixed and adds retrieved context at query time. Different trade-offs.
RAG won the enterprise AI playbook. Every serious AI product ships a RAG pipeline before anything else.