← Back to modules
πŸ”€

Module 1

Tokenisation

Before an AI can read your words, it has to break them up into small pieces called tokens. Let's see how that works.

What is a token?

A token is a chunk of text β€” usually a word, part of a word, or a punctuation mark. AI models don't read letter by letter or word by word. They read in tokens.

For example, the word β€œunhappiness” might become three tokens: β€œun” + β€œhappi” + β€œness”. Shorter, common words usually stay as one token.

πŸ’‘ Why does it matter? AI models have a limit on how many tokens they can process at once β€” this is called the context window. Understanding tokens helps you understand why AI sometimes β€œforgets” earlier parts of a long conversation.

Try it yourself

Token breakdown:

7 tokens23 characters~3.3 chars/token
The
▁
cat
▁
sat
▁
on
▁
the
▁
mat
.

Hover a token to see its ID. Each token maps to a unique number that the AI uses internally.

Token facts

πŸ“
~100,000
Tokens in GPT-4's vocabulary
Modern AI models have a huge vocabulary of possible tokens β€” far more than the ~170,000 words in English.
πŸ“–
128,000
Token context window (Claude Sonnet)
That's roughly 90,000 words β€” about the length of a full novel that the AI can read at once.
⚑
~0.75
Words per token (average)
On average, every token is about three-quarters of a word. A 1,000-word essay is roughly 1,333 tokens.
🌍
Varies
Tokens per language
English text is efficient β€” ~4 characters per token. Some languages use 2–3Γ— more tokens for the same meaning, making AI more expensive to run in those languages.

What you've learned