What Are Tokens?
If you’ve ever hit a context limit or been surprised by an API bill, you already know tokens matter. But what actually are they?
Tokens aren’t words and they aren’t characters. They’re chunks of text that a language model breaks your input into before processing it. Think of them as the atomic units that LLMs actually “read.” The word “tokenization” itself might get split into “token” + “ization” – two tokens. A short word like “the” is typically one token. A long technical term might be three or four.
The exact split depends on the model’s tokenizer – the algorithm it uses to chop up text. Different models use different tokenizers, which is why the same paragraph produces different token counts across GPT, Claude, and Gemini.
How Tokenization Actually Works
Most modern LLMs use a technique called Byte Pair Encoding (BPE) or a close variant. Here’s the gist:
- Start with individual characters (or bytes)
- Find the most frequently occurring pair of adjacent tokens in the training data
- Merge that pair into a single new token
- Repeat thousands of times until you’ve got a vocabulary of 50K-100K tokens
OpenAI’s models use a BPE variant through their tiktoken library. Google’s Gemini models use SentencePiece, which operates on raw text (including spaces) rather than pre-tokenized words. Anthropic’s Claude uses its own BPE-based tokenizer with a vocabulary optimized for code and multilingual text.
The practical difference? Claude tends to produce fewer tokens for the same text (roughly 3.5 characters per token) compared to GPT models (roughly 4 characters per token). That gap widens with code-heavy or multilingual content.
Why Token Counts Matter
Tokens affect three things you care about:
Cost. API pricing is per-token for both input and output. If you’re building an app that sends 10K tokens per request at $10/million input tokens, that’s $0.10 per request – and it adds up fast when you’re handling thousands of users.
Context window. Every model has a maximum number of tokens it can process in a single request. GPT-5.4 handles 256K tokens, Gemini 3 stretches to 2M, and Claude Opus 4.6 sits at 200K. If your prompt plus the expected response exceeds the context window, you’ll need to trim or chunk your input.
Response quality. Longer prompts don’t always mean better results. Models can lose focus in very long contexts (the “lost in the middle” problem). Keeping your prompts concise often improves output quality while cutting costs.
Character-to-Token Ratios by Model Family
| Provider | Models | Avg. Chars/Token |
|---|---|---|
| Anthropic | Claude Opus/Sonnet/Haiku | ~3.5 |
| OpenAI | GPT-5.4, GPT-4o | ~4.0 |
| Gemini 3, Gemini 2.5 | ~4.0 | |
| Meta | Llama 4 Maverick | ~3.8 |
| Mistral | Mistral Large 3 | ~3.8 |
| DeepSeek | DeepSeek V3 | ~3.5 |
| xAI | Grok 3 | ~4.0 |
These ratios are averages for English text. Code, non-Latin scripts, and text with lots of special characters will tokenize differently – usually producing more tokens per character.
Tips for Reducing Token Usage
Want to keep your API costs down? Here are some practical strategies:
- Be specific in your prompts. Vague instructions force the model to guess, which means longer outputs and wasted tokens on both sides.
- Use system prompts wisely. A well-crafted system prompt can replace pages of per-request instructions.
- Trim unnecessary context. Don’t dump an entire document into the prompt if the model only needs two paragraphs.
- Pick the right model. You don’t always need the flagship. For simple tasks, a smaller model like GPT-4o Mini or Claude Haiku gives you 90% of the quality at 5% of the cost.
- Cache repeated content. If your app sends the same system prompt with every request, use prompt caching (available on both OpenAI and Anthropic APIs) to avoid paying for those tokens repeatedly.
- Chunk large documents. Instead of stuffing everything into one request, break documents into chunks and process them separately. Our Chunking Preview tool can help you plan this.
How This Tool Estimates Tokens
This tool uses character-to-token ratios specific to each model family. It divides your text’s character count by the model’s average characters-per-token ratio. While this won’t match the exact output of each provider’s tokenizer (which would require running their specific BPE algorithm), it’s accurate enough for cost estimation and planning purposes.
For exact counts, you’d need to use each provider’s tokenization library directly – tiktoken for OpenAI, the Anthropic SDK’s built-in counter, or Google’s token counting API. But for quick estimates and cost comparisons across models, character-based approximation gets you within 5-10% of the real number.