Why does my Chinese text use so many tokens?

Most tokenizers are trained on English-heavy corpora. Non-Latin scripts tokenize 2-4× worse.

Are tokenizers standardized?

No. Each lab uses their own. OpenAI, Anthropic, Google all have proprietary tokenizers with different vocab sizes and split patterns.

ConceptsReading · ~3 min · 61 words deep

Tokenizer

The algorithm that converts raw text into tokens before the model processes them · BPE, SentencePiece, tiktoken.

TL;DR

The algorithm that converts raw text into tokens before the model processes them · BPE, SentencePiece, tiktoken.

Level 1

Basic

Every LLM ships with a tokenizer. GPT uses tiktoken (BPE variant), Llama uses SentencePiece, Claude uses Claude-specific tokenizer. The tokenizer determines vocabulary size (32K to 200K+ tokens), language efficiency, and how many tokens a given text consumes. Models can't share tokenizers across architectures · swapping means retraining from scratch.

Level 2

Deep

BPE (Byte Pair Encoding) builds vocabulary by greedily merging frequent byte pairs. SentencePiece works at raw byte level, language-agnostic. WordPiece (BERT-era) uses likelihood scoring. Modern frontier tokenizers have 200K+ vocabs to handle multilingual + code + mathematical notation efficiently. Larger vocab = fewer tokens per text but larger embedding tables. Tokenizer choice is locked at pretraining · swapping requires full retrain.

Level 3

Expert

Tiktoken is the canonical reference implementation for OpenAI-family tokenizers · Rust-backed, fast, exposed via tiktoken-py. SentencePiece handles Unicode safely without language-specific rules. Tokenizer efficiency varies: English averages 1.3 tokens per word, Japanese 2-4×, code 1.5-2.5×. The modern 200K+ vocab trend (GPT-4o, Claude 3+) was a response to multilingual inefficiency.

The takeaway for you

Depending on why you're here

If you are a

Researcher

·BPE, SentencePiece, WordPiece are the main algorithms
·Vocab size locked at pretrain · retrain required to swap
·Modern trend: 200K+ vocab for multilingual efficiency

If you are a

Builder

·Use tiktoken to count tokens for OpenAI models · exact count matters for cost
·Anthropic SDK provides count_tokens for Claude
·Non-English workloads tokenize less efficiently · budget accordingly

If you are a

Investor

·Tokenizer efficiency is a hidden pricing lever · multilingual markets suffer
·Chinese/Japanese AI adoption lags partly due to 2-4× token overhead on most tokenizers
·Tokenizer locked at pretrain means switching tokenizers = full retrain

If you are a

Curious · Normie

·The thing that cuts your text into chunks AI can read
·Why you can't compare token counts across different AI providers
·Different AIs cut up words differently

Gecko's take

Tokenizers are the unsung bottleneck of multilingual AI. Every non-English language pays 2-4× the same-quality English price.

Frequently Asked Questions

No · tokenizer is baked in at pretrain. To swap, you'd retrain the entire model.

Tokenizer

Basic

Deep

Expert

Depending on why you're here

Frequently Asked Questions

Related terms

Glossary

Explore live data

Cite or embed