Beta
ConceptsReading · ~3 min · 72 words deep

Hallucination

When a model generates confident-sounding output that is factually wrong, fabricated, or unsupported by reality.

TL;DR

When a model generates confident-sounding output that is factually wrong, fabricated, or unsupported by reality.

Level 1

Hallucinations range from subtle factual errors (wrong date, wrong number) to entirely invented citations or events. Every LLM does it, at some rate. Mitigations include: RAG (ground responses in retrieved documents), RLHF (train to prefer honest answers), reasoning models (extended thinking catches errors), and verification layers (secondary model or tool checks the output).

Level 2

Hallucination root cause: LLMs predict next-token probability, not truth. When the model lacks information, it predicts plausible continuations based on pattern-matching, producing output that looks right but isn't. Rates vary widely: GPT-5 ≤1% hallucination rate on factual tasks with good prompting. Older models: 10-20%+. Mitigations rank: RAG > reasoning + verification > RLHF > prompt engineering. Hallucination is hardest in: obscure facts, numeric details, citations, legal/medical advice, code that uses deprecated APIs.

Level 3

Measured on TruthfulQA (817 adversarial questions designed to elicit common misconceptions), HaluBench, and custom evals. RAG reduces hallucination 3-10× when retrieved context is relevant; near zero reduction when retrieval fails. Self-consistency (sample k outputs, pick majority) reduces some types. Reasoning models with extended thinking catch more errors via internal self-critique. Production systems stack: RAG retriever → verifier → reasoning model → citation extractor. Total hallucination rate for well-engineered pipelines: <0.5% on factual queries.

The takeaway for you
If you are a
Researcher
  • ·Next-token prediction optimizes plausibility, not truth
  • ·TruthfulQA, HaluBench measure adversarially
  • ·RAG reduces 3-10× when retrieval works
If you are a
Builder
  • ·Ship with RAG for any factual workload
  • ·Verify with a secondary model or tool for high-stakes outputs
  • ·Citations force the model to ground claims · reduces confabulation
If you are a
Investor
  • ·Hallucination is the #1 blocker for enterprise AI in regulated industries
  • ·Grounding infra (RAG, verification) is a distinct moat · not model quality
  • ·Every frontier lab is racing to close the last 1%
If you are a
Curious · Normie
  • ·When AI confidently makes things up
  • ·Every AI does it · newer models do it less
  • ·Why you should fact-check AI-generated content
Don't mix them up
HallucinationvsRAG

RAG is the mitigation; hallucination is the problem. RAG grounds answers in real documents so the model has source material to cite.

Gecko's take

Hallucination is a solved problem for well-engineered pipelines. 90% of "AI made it up" stories in 2026 are poorly architected systems, not frontier model limits.

GPT-5 with reasoning + RAG < 0.5% on factual queries. Claude 4.5 Opus similar. All frontier models are within an order of magnitude of each other.