Beta
Stack · RAG

RAG stack

The complete Retrieval-Augmented Generation stack. LLM, embedding model, vector database, and monthly cost across three tiers.

Tiers3
TypeStack recipe
Updated2026-04
What this page is
RAG pipelines have four moving parts: embedding model, vector database, retrieval logic, and LLM. Each can range from premium to free. Most cost lives in the LLM calls · embeddings are cheap and vector DB usually flat. Our cost estimates assume 1M queries per month at ~4K retrieved context + 500 token response each.

Frontier, mainstream, and budget recipes. Pick the row that matches your workload.

Frontier
Premium · max answer quality
Model
Claude Opus 4
in $15/M · out $75/M
Tool · Agent
Pinecone + OpenAI embeddings
Fully managed vector DB · text-embedding-3-large
Estimate · 1M queries · 4K context
~$100K/mo
For legal research, medical knowledge bases, and any RAG where a wrong answer is expensive. Opus grounds tightly in retrieved context and resists confabulation. Pinecone scales without ops. Expensive but defensible.
Mainstream
Mainstream · best value
Model
GPT-5 mini
in $0.25/M · out $2/M
Provider
OpenAI
Tool · Agent
Weaviate + text-embedding-3-small
Managed or self-hosted vector DB · small embeddings
Estimate · 1M queries · 4K context
~$2,000/mo
The default RAG production stack. GPT-5 mini is strong enough for 95% of questions, prompt caching drops cost further, Weaviate offers both managed cloud and self-host. Embedding is nearly free at OpenAI prices.
Budget
Budget · open source
Model
DeepSeek V3.2
in $0.28/M · out $0.84/M
Provider
DeepInfra
Tool · Agent
Qdrant + BGE-M3 embeddings
Self-hosted vector DB + OSS embeddings
Estimate · 1M queries · 4K context
~$1,500/mo
The zero-lock-in stack. Everything open-source or self-hosted. DeepSeek on DeepInfra, Qdrant on a single 8-core VM, BGE-M3 embeddings via a cheap GPU or HF Inference. Add 20 to 30 percent over-provisioning for peaks.

If the defaults do not fit, try these.

Alternative

Sometimes the cheapest path is dumping everything into a 1M window. Works for < 800K-token corpora.

Alternative
Cohere Command R+ + Cohere embeddings

Cohere built Command R+ specifically for RAG with citation grounding. Tight integration if you pay for it.

Alternative
Claude Haiku + in-memory FAISS

Fastest cheap RAG. For a small corpus (< 100K docs), skip the managed DB entirely.

For corpora under 10K docs, no · in-memory FAISS or SQLite is fine. For larger corpora or concurrent queries, yes. Pinecone, Weaviate, Qdrant, Milvus are all solid.