Tokenizer
The algorithm that converts raw text into tokens before the model processes them · BPE, SentencePiece, tiktoken.
The algorithm that converts raw text into tokens before the model processes them · BPE, SentencePiece, tiktoken.
Basic
Every LLM ships with a tokenizer. GPT uses tiktoken (BPE variant), Llama uses SentencePiece, Claude uses Claude-specific tokenizer. The tokenizer determines vocabulary size (32K to 200K+ tokens), language efficiency, and how many tokens a given text consumes. Models can't share tokenizers across architectures · swapping means retraining from scratch.
Deep
BPE (Byte Pair Encoding) builds vocabulary by greedily merging frequent byte pairs. SentencePiece works at raw byte level, language-agnostic. WordPiece (BERT-era) uses likelihood scoring. Modern frontier tokenizers have 200K+ vocabs to handle multilingual + code + mathematical notation efficiently. Larger vocab = fewer tokens per text but larger embedding tables. Tokenizer choice is locked at pretraining · swapping requires full retrain.
Expert
Tiktoken is the canonical reference implementation for OpenAI-family tokenizers · Rust-backed, fast, exposed via tiktoken-py. SentencePiece handles Unicode safely without language-specific rules. Tokenizer efficiency varies: English averages 1.3 tokens per word, Japanese 2-4×, code 1.5-2.5×. The modern 200K+ vocab trend (GPT-4o, Claude 3+) was a response to multilingual inefficiency.
Depending on why you're here
- ·BPE, SentencePiece, WordPiece are the main algorithms
- ·Vocab size locked at pretrain · retrain required to swap
- ·Modern trend: 200K+ vocab for multilingual efficiency
- ·Use tiktoken to count tokens for OpenAI models · exact count matters for cost
- ·Anthropic SDK provides count_tokens for Claude
- ·Non-English workloads tokenize less efficiently · budget accordingly
- ·Tokenizer efficiency is a hidden pricing lever · multilingual markets suffer
- ·Chinese/Japanese AI adoption lags partly due to 2-4× token overhead on most tokenizers
- ·Tokenizer locked at pretrain means switching tokenizers = full retrain
- ·The thing that cuts your text into chunks AI can read
- ·Why you can't compare token counts across different AI providers
- ·Different AIs cut up words differently
Tokenizers are the unsung bottleneck of multilingual AI. Every non-English language pays 2-4× the same-quality English price.