Beta
ConceptsReading · ~3 min · 61 words deep

Tokenizer

The algorithm that converts raw text into tokens before the model processes them · BPE, SentencePiece, tiktoken.

TL;DR

The algorithm that converts raw text into tokens before the model processes them · BPE, SentencePiece, tiktoken.

Level 1

Every LLM ships with a tokenizer. GPT uses tiktoken (BPE variant), Llama uses SentencePiece, Claude uses Claude-specific tokenizer. The tokenizer determines vocabulary size (32K to 200K+ tokens), language efficiency, and how many tokens a given text consumes. Models can't share tokenizers across architectures · swapping means retraining from scratch.

Level 2

BPE (Byte Pair Encoding) builds vocabulary by greedily merging frequent byte pairs. SentencePiece works at raw byte level, language-agnostic. WordPiece (BERT-era) uses likelihood scoring. Modern frontier tokenizers have 200K+ vocabs to handle multilingual + code + mathematical notation efficiently. Larger vocab = fewer tokens per text but larger embedding tables. Tokenizer choice is locked at pretraining · swapping requires full retrain.

Level 3

Tiktoken is the canonical reference implementation for OpenAI-family tokenizers · Rust-backed, fast, exposed via tiktoken-py. SentencePiece handles Unicode safely without language-specific rules. Tokenizer efficiency varies: English averages 1.3 tokens per word, Japanese 2-4×, code 1.5-2.5×. The modern 200K+ vocab trend (GPT-4o, Claude 3+) was a response to multilingual inefficiency.

The takeaway for you
If you are a
Researcher
  • ·BPE, SentencePiece, WordPiece are the main algorithms
  • ·Vocab size locked at pretrain · retrain required to swap
  • ·Modern trend: 200K+ vocab for multilingual efficiency
If you are a
Builder
  • ·Use tiktoken to count tokens for OpenAI models · exact count matters for cost
  • ·Anthropic SDK provides count_tokens for Claude
  • ·Non-English workloads tokenize less efficiently · budget accordingly
If you are a
Investor
  • ·Tokenizer efficiency is a hidden pricing lever · multilingual markets suffer
  • ·Chinese/Japanese AI adoption lags partly due to 2-4× token overhead on most tokenizers
  • ·Tokenizer locked at pretrain means switching tokenizers = full retrain
If you are a
Curious · Normie
  • ·The thing that cuts your text into chunks AI can read
  • ·Why you can't compare token counts across different AI providers
  • ·Different AIs cut up words differently
Gecko's take

Tokenizers are the unsung bottleneck of multilingual AI. Every non-English language pays 2-4× the same-quality English price.

No · tokenizer is baked in at pretrain. To swap, you'd retrain the entire model.