Can hallucination be eliminated?

Not for open-ended generation. Can be driven below 0.5% for factual workloads with RAG + verification. Zero-hallucination is not achievable with current architectures.

Why do AIs cite fake papers?

Pattern-matching on citation format without access to real databases. Use Perplexity or a RAG system connected to an academic API to get real citations.

ConceptsReading · ~3 min · 72 words deep

Hallucination

Q: Which model hallucinates least?

GPT-5 with reasoning + RAG < 0.5% on factual queries. Claude 4.5 Opus similar. All frontier models are within an order of magnitude of each other.

When a model generates confident-sounding output that is factually wrong, fabricated, or unsupported by reality.

TL;DR

When a model generates confident-sounding output that is factually wrong, fabricated, or unsupported by reality.

Level 1

Basic

Hallucinations range from subtle factual errors (wrong date, wrong number) to entirely invented citations or events. Every LLM does it, at some rate. Mitigations include: RAG (ground responses in retrieved documents), RLHF (train to prefer honest answers), reasoning models (extended thinking catches errors), and verification layers (secondary model or tool checks the output).

Level 2

Deep

Hallucination root cause: LLMs predict next-token probability, not truth. When the model lacks information, it predicts plausible continuations based on pattern-matching, producing output that looks right but isn't. Rates vary widely: GPT-5 ≤1% hallucination rate on factual tasks with good prompting. Older models: 10-20%+. Mitigations rank: RAG > reasoning + verification > RLHF > prompt engineering. Hallucination is hardest in: obscure facts, numeric details, citations, legal/medical advice, code that uses deprecated APIs.

Level 3

Expert

Measured on TruthfulQA (817 adversarial questions designed to elicit common misconceptions), HaluBench, and custom evals. RAG reduces hallucination 3-10× when retrieved context is relevant; near zero reduction when retrieval fails. Self-consistency (sample k outputs, pick majority) reduces some types. Reasoning models with extended thinking catch more errors via internal self-critique. Production systems stack: RAG retriever → verifier → reasoning model → citation extractor. Total hallucination rate for well-engineered pipelines: <0.5% on factual queries.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·Next-token prediction optimizes plausibility, not truth
·TruthfulQA, HaluBench measure adversarially
·RAG reduces 3-10× when retrieval works

If you are a

Builder

·Ship with RAG for any factual workload
·Verify with a secondary model or tool for high-stakes outputs
·Citations force the model to ground claims · reduces confabulation

If you are a

Investor

·Hallucination is the #1 blocker for enterprise AI in regulated industries
·Grounding infra (RAG, verification) is a distinct moat · not model quality
·Every frontier lab is racing to close the last 1%

If you are a

Curious · Normie

·When AI confidently makes things up
·Every AI does it · newer models do it less
·Why you should fact-check AI-generated content

Don't mix them up

Often confused with

HallucinationvsRAG

RAG is the mitigation; hallucination is the problem. RAG grounds answers in real documents so the model has source material to cite.

Gecko's take

Hallucination is a solved problem for well-engineered pipelines. 90% of "AI made it up" stories in 2026 are poorly architected systems, not frontier model limits.

Frequently Asked Questions

GPT-5 with reasoning + RAG < 0.5% on factual queries. Claude 4.5 Opus similar. All frontier models are within an order of magnitude of each other.

Hallucination

Basic

Deep

Expert

Depending on why you're here

Often confused with

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed