Stack · RAG
RAG stack
The complete Retrieval-Augmented Generation stack. LLM, embedding model, vector database, and monthly cost across three tiers.
Tiers3
TypeStack recipe
Updated2026-04
What this page is
RAG pipelines have four moving parts: embedding model, vector database, retrieval logic, and LLM. Each can range from premium to free. Most cost lives in the LLM calls · embeddings are cheap and vector DB usually flat. Our cost estimates assume 1M queries per month at ~4K retrieved context + 500 token response each.
Tier-by-tier breakdown
Frontier, mainstream, and budget recipes. Pick the row that matches your workload.
Frontier
Premium · max answer quality
Provider
Anthropic directTool · Agent
Pinecone + OpenAI embeddings
Fully managed vector DB · text-embedding-3-large
Estimate · 1M queries · 4K context
~$100K/mo
For legal research, medical knowledge bases, and any RAG where a wrong answer is expensive. Opus grounds tightly in retrieved context and resists confabulation. Pinecone scales without ops. Expensive but defensible.
Mainstream
Mainstream · best value
Provider
OpenAITool · Agent
Weaviate + text-embedding-3-small
Managed or self-hosted vector DB · small embeddings
Estimate · 1M queries · 4K context
~$2,000/mo
The default RAG production stack. GPT-5 mini is strong enough for 95% of questions, prompt caching drops cost further, Weaviate offers both managed cloud and self-host. Embedding is nearly free at OpenAI prices.
Budget
Budget · open source
Provider
DeepInfraTool · Agent
Qdrant + BGE-M3 embeddings
Self-hosted vector DB + OSS embeddings
Estimate · 1M queries · 4K context
~$1,500/mo
The zero-lock-in stack. Everything open-source or self-hosted. DeepSeek on DeepInfra, Qdrant on a single 8-core VM, BGE-M3 embeddings via a cheap GPU or HF Inference. Add 20 to 30 percent over-provisioning for peaks.
Alternative picks
If the defaults do not fit, try these.
Alternative
Sometimes the cheapest path is dumping everything into a 1M window. Works for < 800K-token corpora.
Alternative
Cohere Command R+ + Cohere embeddings
Cohere built Command R+ specifically for RAG with citation grounding. Tight integration if you pay for it.
Alternative
Claude Haiku + in-memory FAISS
Fastest cheap RAG. For a small corpus (< 100K docs), skip the managed DB entirely.
Frequently asked questions
For corpora under 10K docs, no · in-memory FAISS or SQLite is fine. For larger corpora or concurrent queries, yes. Pinecone, Weaviate, Qdrant, Milvus are all solid.