Why Transformers Changed Everything
In 2017, Google researchers published "Attention Is All You Need" — arguably the most influential AI paper of the decade. The transformer architecture it introduced replaced RNNs and LSTMs as the dominant approach to sequence processing, and it powers virtually every modern AI system: GPT-4, Claude, Gemini, DALL-E, Stable Diffusion, and more.
The key innovation? Self-attention — a mechanism that lets the model consider all positions in a sequence simultaneously, rather than processing tokens one at a time like RNNs.
The Problem with RNNs
Before transformers, RNNs processed text sequentially — one word at a time. This had two major problems:
- Vanishing gradients — information from early tokens gets diluted through many sequential steps, making it hard to learn long-range dependencies
- Sequential processing — each step depends on the previous one, making training slow and impossible to parallelize across GPUs
LSTMs and GRUs partially addressed the vanishing gradient problem but couldn't fix the parallelization issue. Transformers solve both elegantly.
Self-Attention: The Core Mechanism
Self-attention lets each token in a sequence "attend to" every other token, computing how relevant each one is. For example, in "The cat sat on the mat because it was tired," self-attention helps the model understand that "it" refers to "cat" — even though they're separated by several words.
How it works:
- Each token is transformed into three vectors: Query (Q), Key (K), and Value (V) via learned linear projections
- Attention scores are computed: score = Q · Kᵀ / √d_k (dot product of query with all keys, scaled)
- Scores are passed through softmax to get attention weights (probabilities that sum to 1)
- The output is the weighted sum of values: Attention(Q,K,V) = softmax(QKᵀ/√d_k)V
The scaling factor √d_k prevents the dot products from getting too large, which would push softmax into regions with extremely small gradients.
Multi-Head Attention
Instead of computing attention once, transformers use multiple "heads" — parallel attention computations with different learned projections. Each head can focus on different types of relationships:
- One head might learn syntactic relationships (subject-verb agreement)
- Another might learn semantic relationships (synonyms, co-references)
- Another might focus on positional patterns
The outputs from all heads are concatenated and projected back to the model dimension. GPT-3 uses 96 attention heads; GPT-4 likely uses even more.
Positional Encoding
Since self-attention processes all tokens simultaneously (unlike RNNs), the model has no inherent notion of word order. Positional encodings are added to the input embeddings to inject position information.
The original paper used sinusoidal functions: PE(pos, 2i) = sin(pos/10000^(2i/d_model)). Modern models often use learned positional embeddings or Rotary Position Embeddings (RoPE) for better extrapolation to longer sequences.
The Full Transformer Architecture
Encoder (used in BERT-style models)
Each encoder layer contains: Multi-Head Self-Attention → Add & LayerNorm → Feed-Forward Network → Add & LayerNorm. The "Add" refers to residual connections — the input is added to the output of each sublayer, preventing the vanishing gradient problem in deep networks.
Decoder (used in GPT-style models)
Similar to encoder but with masked self-attention — future tokens are hidden during training, forcing the model to predict the next token based only on previous ones. This is why GPT generates text left-to-right.
Feed-Forward Network
After attention, each position passes through a two-layer fully connected network with a non-linear activation (typically GELU). This is where much of the model's "knowledge" is stored — the weights encode factual information learned during pre-training.
Why Transformers Scale So Well
- Parallelizable — all positions processed simultaneously during training (unlike sequential RNNs)
- Scaling laws — performance improves predictably with more parameters, data, and compute
- Transfer learning — pre-trained transformers can be fine-tuned for specific tasks with minimal data
- Flexible architecture — same core design works for text, images (ViT), audio (Whisper), and multimodal inputs
Understand transformers hands-on with our Transformer Architecture Deep Dive lesson, featuring interactive quizzes and visual explanations. Our Large Language Models lesson shows how transformers are trained at scale. Get full access to all 31 lessons to master the architecture behind modern AI.