π€
Module 1
Tokenisation
Before an AI can read your words, it has to break them up into small pieces called tokens. Let's see how that works.
What is a token?
A token is a chunk of text β usually a word, part of a word, or a punctuation mark. AI models don't read letter by letter or word by word. They read in tokens.
For example, the word βunhappinessβ might become three tokens: βunβ + βhappiβ + βnessβ. Shorter, common words usually stay as one token.
π‘ Why does it matter? AI models have a limit on how many tokens they can process at once β this is called the context window. Understanding tokens helps you understand why AI sometimes βforgetsβ earlier parts of a long conversation.
Try it yourself
Token breakdown:
7 tokens23 characters~3.3 chars/token
The
βcat
βsat
βon
βthe
βmat
.
Hover a token to see its ID. Each token maps to a unique number that the AI uses internally.
Token facts
π
~100,000
Tokens in GPT-4's vocabulary
Modern AI models have a huge vocabulary of possible tokens β far more than the ~170,000 words in English.
π
128,000
Token context window (Claude Sonnet)
That's roughly 90,000 words β about the length of a full novel that the AI can read at once.
β‘
~0.75
Words per token (average)
On average, every token is about three-quarters of a word. A 1,000-word essay is roughly 1,333 tokens.
π
Varies
Tokens per language
English text is efficient β ~4 characters per token. Some languages use 2β3Γ more tokens for the same meaning, making AI more expensive to run in those languages.
What you've learned
- βAI models split text into tokens β not letters, not always whole words
- βCommon words are usually one token; rare or long words get split up
- βEvery token becomes a number that the AI can process mathematically
- βThe total number of tokens determines how much the AI can read at once
Module 1 of 6Next: M2 Embeddings β