Transformer Architecture

The transformer is the engine inside every modern AI. It combines tokenisation, embeddings, attention, and prediction into one powerful pipeline— and it's been changing the world since 2017.

Select your year level:

🧠 Foundation: What is a Neural Network?

Before we explore the transformer, we need to understand what it's built on. A transformer is a type of neural network — so let's start there.

A simple neural network

👆 Pick an emoji to see how a neural network processes it

What is a Neural Network?

Imagine your brain is made of tiny decision-makers called neurons. A neural network in AI works a bit like that — it is a chain of simple decision-making nodes organised into layers. Each node takes in information, processes it, and passes its result to the next layer.

These nodes are connected by weights — numbers that control how strongly each connection influences the next node. It is like a complex game of telephone where each person makes a small adjustment to the message before passing it on.

How Networks Learn

Neural networks learn by looking at lots and lots of examples, called training data. At first they make many mistakes — like a new student trying to solve a puzzle. But after each attempt, the network figures out how far off its answer was.

It then carefully adjusts the weights between its nodes, trying to get a little bit closer to the right answer next time. This process repeats thousands or even millions of times. Over time the network gets better and better — much like practising a skill until you become an expert.

💡 Think of teaching a computer to recognise a cat: you show it millions of pictures of cats (and non-cats!), and it slowly learns what features make something cat-like by adjusting its internal weights.

From Neural Network to Transformer

While all powerful AIs are neural networks, some are very special. A Transformer is a clever type of neural network invented specifically for understanding and generating language. It is built with unique mechanisms — like attention — that help it process all parts of a sentence simultaneously.

This Transformer architecture is what powers the most famous large language models today — Google Gemini, Anthropic Claude, and OpenAI GPT. They can have billions of nodes and weights, making them extraordinarily powerful language machines.

🏗️

Now let's look inside the transformer

You now know what a neural network is and how it learns. A transformer takes this further — adding the attention mechanism from Module 3 and stacking many layers to build something that can understand and generate human language. Keep reading to see how it all fits together.

What is a transformer?

A transformer is a type of neural network architecture that processes sequences of text by combining four key steps into one pipeline:

1.Tokenisation — Text is broken into tokens (small chunks).
2.Embeddings — Tokens become high-dimensional vectors capturing meaning.
3.Attention — Each token attends to all others, building contextual representations.
4.Prediction — The final representation is used to predict the next token.

These steps repeat for every new token generated — a transformer produces text one token at a time, feeding each output back as input.

Encoder vs Decoder

The original transformer (2017) had two parts:

Encoder

Reads the full input sequence and builds a rich internal representation. Used for tasks like translation (understanding the source language) or text classification.

Example: BERT

Decoder

Generates new text one token at a time. Each new token is conditioned on all previous tokens. GPT models are decoder-only — they generate without a separate encoder.

Example: GPT-4, Claude, Gemini

Most modern AI chatbots are decoder-only transformers — they simply predict the next token, over and over, until a complete response is formed.

Why layers matter

A transformer stacks its attention and feed-forward operations into many layers. Each layer's output becomes the next layer's input — building progressively richer representations:

→Early layers (1–4): Basic grammar, word order, punctuation
→Middle layers (5–8): Word meanings, entity recognition, coreference
→Late layers (9+): Reasoning, inference, tone, and task-specific patterns

Depth equals abstraction. More layers = more complex patterns = more capable model.

Interactive transformer diagram

Click any block to learn what it does.

↓

× N layers

↓

[Attention + Feed Forward] repeats N times (e.g. 12 or 96 layers)

Select a block above to see its description

🏗️ Did you know?The original transformer was introduced in a 2017 paper called “Attention Is All You Need”. Before transformers, AI struggled with long sequences of text. Transformers solved this by letting every word look at every other word simultaneously — rather than reading left to right one word at a time. This parallelism made transformers dramatically faster to train and far better at understanding context.

What you've learned

✓A transformer combines tokenisation, embeddings, attention, and prediction into one pipeline.
✓Information passes through many layers — each one refines the model's understanding.
✓Modern AI models like GPT-4 use hundreds of billions of parameters across these layers.
✓The transformer architecture, invented in 2017, powers almost every modern AI language model.

← M3: Attention M5: Prediction →