Which embedding model is cheapest with good quality?

OpenAI text-embedding-3-small at $0.02/M tokens is extremely cheap and strong. For open-source, BGE-M3 or Nomic Embed are close in quality and can self-host. Voyage-3 is the quality leader but not free.

Should I use reranking?

Yes, for any serious RAG. Cohere rerank or a BGE reranker on the top 50 retrieved chunks before the LLM doubles precision. Cost impact is minimal.

How do I know if my retrieval is the bottleneck?

Swap in the ground-truth context manually and run the same query. If the LLM gets it right with ground truth but wrong with retrieved context, retrieval is broken. If it gets it wrong with ground truth too, you need a better model.

When does 1M context beat RAG?

When your corpus fits in the window AND the question requires broad context (e.g. summarize across the whole codebase). For focused questions on large corpora, RAG always wins.

Stack · RAG

RAG stack

Q: Do I need a vector database?

For corpora under 10K docs, no · in-memory FAISS or SQLite is fine. For larger corpora or concurrent queries, yes. Pinecone, Weaviate, Qdrant, Milvus are all solid.

The complete Retrieval-Augmented Generation stack. LLM, embedding model, vector database, and monthly cost across three tiers.

Tiers3

TypeStack recipe

Updated2026-04

All pricing Compare models

What this page is

RAG pipelines have four moving parts: embedding model, vector database, retrieval logic, and LLM. Each can range from premium to free. Most cost lives in the LLM calls · embeddings are cheap and vector DB usually flat. Our cost estimates assume 1M queries per month at ~4K retrieved context + 500 token response each.

Tier-by-tier breakdown

Frontier, mainstream, and budget recipes. Pick the row that matches your workload.

Frontier

Premium · max answer quality

Model

Claude Opus 4

in $15/M · out $75/M

Provider

Anthropic direct

Tool · Agent

Pinecone + OpenAI embeddings

Fully managed vector DB · text-embedding-3-large

Estimate · 1M queries · 4K context

~$100K/mo

For legal research, medical knowledge bases, and any RAG where a wrong answer is expensive. Opus grounds tightly in retrieved context and resists confabulation. Pinecone scales without ops. Expensive but defensible.

Mainstream

Mainstream · best value

Model

GPT-5 mini

in $0.25/M · out $2/M

Provider

OpenAI

Tool · Agent

Weaviate + text-embedding-3-small

Managed or self-hosted vector DB · small embeddings

Estimate · 1M queries · 4K context

~$2,000/mo

The default RAG production stack. GPT-5 mini is strong enough for 95% of questions, prompt caching drops cost further, Weaviate offers both managed cloud and self-host. Embedding is nearly free at OpenAI prices.

Budget

Budget · open source

Model

DeepSeek V3.2

in $0.28/M · out $0.84/M

Provider

DeepInfra

Tool · Agent

Qdrant + BGE-M3 embeddings

Self-hosted vector DB + OSS embeddings

Estimate · 1M queries · 4K context

~$1,500/mo

The zero-lock-in stack. Everything open-source or self-hosted. DeepSeek on DeepInfra, Qdrant on a single 8-core VM, BGE-M3 embeddings via a cheap GPU or HF Inference. Add 20 to 30 percent over-provisioning for peaks.

Alternative picks

If the defaults do not fit, try these.

Alternative

Gemini 2.5 Pro + 1M context (no RAG)

Sometimes the cheapest path is dumping everything into a 1M window. Works for < 800K-token corpora.

Alternative

Cohere Command R+ + Cohere embeddings

Cohere built Command R+ specifically for RAG with citation grounding. Tight integration if you pay for it.

Alternative

Claude Haiku + in-memory FAISS

Fastest cheap RAG. For a small corpus (< 100K docs), skip the managed DB entirely.

Frequently asked questions

For corpora under 10K docs, no · in-memory FAISS or SQLite is fine. For larger corpora or concurrent queries, yes. Pinecone, Weaviate, Qdrant, Milvus are all solid.

RAG stack

Tier-by-tier breakdown

Alternative picks

Frequently asked questions

See also

Other stacks

Related

Compare