Beta
ConceptsReading · ~3 min · 75 words deep

Embedding

A dense numerical vector (256 to 8192 dimensions) that captures the semantic meaning of text, images, or audio.

TL;DR

A dense numerical vector (256 to 8192 dimensions) that captures the semantic meaning of text, images, or audio.

Level 1

An embedding turns text into a fixed-size vector of numbers. Similar meanings produce nearby vectors · "dog" and "puppy" map to similar points, "dog" and "car" to distant points. Embeddings are the foundation of RAG, semantic search, clustering, and recommendations. Production embedding models: text-embedding-3-large (OpenAI), Voyage 3, Cohere embed, NV-Embed (NVIDIA).

Level 2

Modern embedding models are trained with contrastive learning: pairs of semantically related text (e.g., query and relevant doc) are pulled close in embedding space while random pairs are pushed apart. Dimensions range from 256 (fast) to 8192 (highest quality). Cosine similarity is the dominant distance metric. Good embedding models hit 70%+ MTEB leaderboard scores. Costs: $0.05-0.20/M tokens on the major APIs. Embedding a large corpus is a one-time cost; serving queries is fast and cheap.

Level 3

Training objective: contrastive NT-Xent loss or InfoNCE. Hard negative mining (sampling confusingly-similar negatives) improves retrieval quality. Dimensionality trade-offs: 1536D gives strong retrieval on MTEB but 8GB per million items in memory. Matryoshka embeddings (truncate to shorter dims at query time) give flexibility. Domain adaptation via in-domain fine-tuning improves retrieval 10-30% on specialized corpora. Evaluated with MTEB (Massive Text Embedding Benchmark) across 56 tasks. Top 2026: NV-Embed-v2 (73.7), Voyage-3 (~73), text-embedding-3-large (~70).

The takeaway for you
If you are a
Researcher
  • ·Contrastive learning with hard negatives
  • ·MTEB is the canonical benchmark · 56 task suite
  • ·Matryoshka embeddings enable dimensionality flexibility
If you are a
Builder
  • ·Use OpenAI text-embedding-3-large or Voyage 3 for most RAG
  • ·Rerank top-20 retrieved results with a cross-encoder for quality
  • ·Domain fine-tuning pays off for specialized corpora
If you are a
Investor
  • ·Embedding infra is commoditizing · pricing dropped 10× since 2023
  • ·Vector DB market consolidating · pgvector + MongoDB Atlas winning
  • ·Rerankers and hybrid search are the new premium tier
If you are a
Curious · Normie
  • ·Turns words into numbers so computers can find similar meanings
  • ·Why AI can understand "cheap phone" and match you to iPhone-SE
  • ·The foundation of AI search
Gecko's take

Embeddings are commodity infrastructure in 2026. Pick any top-5 provider and move on · the real moat is retrieval quality and reranking.

NV-Embed-v2 and Voyage-3 lead MTEB. OpenAI text-embedding-3-large is the default production choice for most teams.