What is a good RAG chunk size?

300-800 tokens with 10-20% overlap is a safe default. Tune based on your document structure.

Do I need a vector database?

For small corpora (< 10K docs), you can use in-memory similarity search. Beyond that, use pgvector, Pinecone, Weaviate, or similar.

ConceptsRAGReading · ~3 min · 93 words deep

RAG

Q: When do I use RAG vs fine-tuning?

RAG for fast-changing knowledge (product docs, news, company data). Fine-tuning for stable patterns (style, tone, task-specific behavior). Often combined.

Q: Do I need a vector database?

For small corpora (< 10K docs), you can use in-memory similarity search. Beyond that, use pgvector, Pinecone, Weaviate, or similar.

A technique that retrieves relevant documents before answering, so the model grounds its output in real data instead of fabricating.

TL;DR

A technique that retrieves relevant documents before answering, so the model grounds its output in real data instead of fabricating.

Level 1

Basic

RAG pipelines have two steps. First, a retriever (often vector search over embeddings) finds relevant documents for the user's query. Second, the model generates an answer conditioned on those documents. RAG reduces hallucination and enables citing sources. It is the most common architecture for enterprise AI that needs to answer from proprietary data.

Level 2

Deep

A RAG system has three components: an embedding model (e.g., text-embedding-3-large, Voyage, Cohere embed), a vector database (Pinecone, Weaviate, pgvector, Chroma), and a generation model (any LLM). Query → embedding → similarity search → top-k documents → prompt template with retrieved context → generation. Quality hinges on: chunk size, chunk overlap, retrieval ranking, reranking (Cohere Rerank, Voyage Rerank), and prompt-template design. Advanced patterns include: hybrid search (sparse + dense), HyDE (hypothetical document embeddings), query decomposition, and graph RAG. Cost dominates on embedding (re-embedding every new document) and inference (larger contexts from retrieved docs).

Level 3

Expert

RAG effectiveness is bounded by retrieval recall and generator grounding. Recall drops when chunking splits semantic units; fix via sentence-transformer-aware chunkers or structure-aware splitters. Ranking quality improves with cross-encoder rerankers but adds latency. Context window utilization matters: stuffing too many documents degrades performance due to "lost in the middle" attention patterns. Evaluation uses RAGAS or TruLens metrics: faithfulness, answer relevance, context precision, context recall. Graph RAG builds a knowledge graph from documents and traverses it for multi-hop queries. Self-RAG lets the model decide when to retrieve. Agentic RAG adds tool use for complex query decomposition.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Retrieve + generate pipeline with embedding model + vector DB + LLM
·Faithfulness, context precision, context recall are core metrics
·HyDE, hybrid search, rerankers push quality up · graph RAG for multi-hop

If you are a

Builder

·Start with OpenAI embeddings + pgvector + Claude/GPT-4
·Chunk size 300-800 tokens with 10-20% overlap is a safe default
·Add a reranker if retrieval recall is low

If you are a

Investor

·RAG is the dominant enterprise AI architecture · high adoption
·Vector DB market is consolidating · MongoDB Atlas Search, pgvector winning
·Embedding + rerank is commoditizing · pricing dropping fast

If you are a

Curious · Normie

·AI that reads your docs before answering
·Why ChatGPT lets you upload files · RAG under the hood
·Reduces the "made-up answer" problem

Don't mix them up

Often confused with

RAGvsFine-tuning

Fine-tuning changes model weights for new behavior. RAG keeps weights fixed and adds retrieved context at query time. Different trade-offs.

Gecko's take

RAG won the enterprise AI playbook. Every serious AI product ships a RAG pipeline before anything else.

Frequently Asked Questions

RAG for fast-changing knowledge (product docs, news, company data). Fine-tuning for stable patterns (style, tone, task-specific behavior). Often combined.

RAG

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed