Tokenisation

Before an AI can read your words, it has to break them up into small pieces called tokens. Let's see how that works.

What is a token?

A token is a chunk of text — usually a word, part of a word, or a punctuation mark. AI models don't read letter by letter or word by word. They read in tokens.

For example, the word “unhappiness” might become three tokens: “un” + “happi” + “ness”. Shorter, common words usually stay as one token.

💡 Why does it matter? AI models have a limit on how many tokens they can process at once — this is called the context window. Understanding tokens helps you understand why AI sometimes “forgets” earlier parts of a long conversation.

Try it yourself

Type anything — watch it become tokens:

Token breakdown:

7 tokens23 characters~3.3 chars/token

The

▁

cat

▁

sat

▁

the

▁

mat

Hover a token to see its ID. Each token maps to a unique number that the AI uses internally.

Token facts

📏

~100,000

Tokens in GPT-4's vocabulary

Modern AI models have a huge vocabulary of possible tokens — far more than the ~170,000 words in English.

📖

128,000

Token context window (Claude Sonnet)

That's roughly 90,000 words — about the length of a full novel that the AI can read at once.

⚡

~0.75

Words per token (average)

On average, every token is about three-quarters of a word. A 1,000-word essay is roughly 1,333 tokens.

🌍

Varies

Tokens per language

English text is efficient — ~4 characters per token. Some languages use 2–3× more tokens for the same meaning, making AI more expensive to run in those languages.

What you've learned

✓AI models split text into tokens — not letters, not always whole words
✓Common words are usually one token; rare or long words get split up
✓Every token becomes a number that the AI can process mathematically
✓The total number of tokens determines how much the AI can read at once

Module 1 of 6Next: M2 Embeddings →