Gemini Token Counting
Google’s Gemini models stand out for one headline feature: context window size. Gemini 3 supports 2 million tokens in a single request – that’s roughly 1.5 million words, or about 20 novels. No other commercial model comes close.
Gemini uses a SentencePiece tokenizer, which processes raw text (including whitespace) rather than pre-splitting on word boundaries. For English, it averages about 4 characters per token, putting it in the same ballpark as GPT. The tokenizer handles over 100 languages and is particularly efficient with CJK characters compared to BPE-based alternatives.
Gemini Model Options
| Model | Context | Max Output | Input $/1M | Output $/1M |
|---|---|---|---|---|
| Gemini 3 | 2M | 64K | $7.00 | $21.00 |
| Gemini 2.5 Pro | 1M | 32K | $3.50 | $10.50 |
| Gemini 2.5 Flash | 1M | 16K | $0.15 | $0.60 |
The lineup covers every use case. Gemini 3 is the powerhouse for tasks that need maximum context and reasoning. Gemini 2.5 Pro balances capability with cost. And Flash is ridiculously cheap for high-volume tasks – at $0.15 per million input tokens, it competes with GPT-4o Mini and Claude Haiku on price.
When That Giant Context Window Matters
Most tasks don’t need 2 million tokens of context. But when they do, Gemini is your only option among commercial APIs:
- Full codebase analysis. Drop an entire repository into the context and ask questions about architecture, dependencies, or potential bugs.
- Legal document review. Process complete contracts, regulations, or patent filings without chunking.
- Book-length content. Summarize, analyze, or translate entire books in a single pass.
- Long conversation history. Maintain very long chat sessions without losing earlier context.
Even with the massive context, keep in mind that more tokens means higher latency and cost. If you can accomplish the task with less context, you probably should.
Google’s Count Tokens API
For exact token counts in production, Google provides a countTokens API endpoint that returns the precise count for any input. It works with text, images, video, and audio. This tool estimates text tokens using the character-to-token ratio, which gets you within 5-10% for planning purposes. For multimodal content, you’ll want to use the API directly.