Module 3
Attention
When you read a sentence, your brain automatically focuses on the most relevant words to understand meaning. AI does the same thing β using a mechanism called attention to decide which words matter most in context.
Select your year level:
What is the attention mechanism?
Consider the sentence: βI sat by the river bank.β Does βbankβ mean a financial institution, or the side of a river? To know, the AI must look at the other words β especially βriverβ β and decide which ones matter most.
This is exactly what attention does. It lets each word in a sentence look at all the other words and assign a weight β a score representing how relevant each other word is to understanding its meaning in this particular context.
Queries, Keys, and Values
Under the hood, each word is transformed into three vectors:
- QueryWhat is this word looking for? (βWhat context do I need?β)
- KeyWhat does this word offer to others? (βWhat information can I provide?β)
- ValueWhat information gets passed on if selected? (βMy actual contentβ)
The attention score between two words is calculated by comparing the Query of one word against the Keys of all others. Higher scores mean more influence.
Explore attention patterns
π Attention Visualiser
Yr 7β8Click any word to see what it pays attention to.
Word types in this sentence:
π¦ Did you know? GPT-4 uses 96 attention heads across 96 transformer layers. Each head independently learns to notice different linguistic patterns β some track pronouns, some track verbs and their objects, some detect sentiment. Together they form a rich, multi-dimensional understanding of context.
What you've learned
- βAttention allows each word to consider all other words when building its meaning.
- βWords are assigned attention weights β higher weights mean more influence on understanding.
- βQueries, Keys, and Values are the mathematical tools that compute these weights.
- βMulti-head attention runs multiple attention patterns simultaneously for richer understanding.