Quick answer: What is a transformer in AI? A transformer is a neural network architecture that processes sequences of data — text, images, audio, or video — by letting every element directly attend to every other element simultaneously, rather than processing them one by one. Introduced in the 2017 paper "Attention Is All You Need" (Vaswani et al., Google Brain), it is the foundation of every major AI system today: GPT-4, Claude, Gemini, Llama, DALL-E, Stable Diffusion, Whisper, AlphaFold, and GitHub Copilot all use variants of the same core architecture. |
In 2017, eight researchers at Google Brain published a twelve-page paper with a confident title: "Attention Is All You Need." They were trying to improve machine translation. What they accidentally produced was the foundation of modern artificial intelligence.
Every AI product that matters right now runs on a direct descendant of that paper. GPT-4, Claude, Gemini, Llama, DALL-E 3, Stable Diffusion, GitHub Copilot, Whisper, AlphaFold 2 — all transformers. Not inspired by the transformer. Not distantly related to it. Literally the same architecture, scaled up.
So what is in those twelve pages? What did those eight researchers discover that no one had seen before? And why does the same mechanism work equally well for language, images, protein structures, weather forecasting, and drug discovery?
This guide answers all of that — from the problem the transformer was designed to solve, through every component of its architecture, to how it learns, how it generates output, and where it still falls short. No machine learning degree required. Any mathematics that appears will be explained in plain English before the symbol appears.
The Problem Transformers Solved — and Why Everything Before Failed
To understand why transformers matter, you need to understand what they replaced — and specifically, why the replacement felt so inevitable in retrospect.
The sequence problem
Language, music, video, and biological sequences all share a critical property: order matters. "The dog bit the man" and "The man bit the dog" are identical words in a different arrangement with opposite meanings. Any model that processes these inputs must track the relationships between elements across potentially very long sequences.
By the mid-2010s, the dominant approach was the Recurrent Neural Network (RNN). An RNN processes a sequence one element at a time — left to right — carrying a "hidden state" forward at each step: a compressed, fixed-size representation of everything the model has seen so far.
Analogy: reading a novel while only allowed to keep a single Post-it note of running observations. By chapter twenty, your note is so overwritten that chapter one is effectively forgotten.
Why RNNs failed: two architectural walls
RNNs hit two fundamental ceilings that no amount of engineering could fully overcome:
The vanishing gradient problem: training a neural network requires propagating error signals backward through all the steps it processed. In an RNN processing a 500-word document, the gradient signal from step 500 must travel backward through 499 multiplication operations before it reaches step 1. Each multiplication attenuates it. By the time it arrives, the signal is near-zero — the model cannot learn that words at the start of a long document are related to words at the end. Long-range dependencies become impossible to capture reliably.
Sequential processing: each RNN step depends on the previous hidden state. You cannot compute step 5 without first computing steps 1 through 4. Modern GPU hardware is built for massive parallelism — thousands of operations simultaneously. Sequential dependencies make RNNs fundamentally unable to exploit this. Training a large RNN on a billion tokens meant a billion sequential operations per pass. Brutally slow.
LSTMs and GRUs: patching the leaks
Long Short-Term Memory networks (LSTMs), introduced by Hochreiter and Schmidhuber in 1997, added explicit gating mechanisms — "forget gates," "input gates," and "output gates" — that could learn to selectively preserve or discard information over long sequences. Gated Recurrent Units (GRUs) simplified this slightly with similar results.
These were genuine advances. LSTMs powered the best machine translation systems, speech recognition engines, and text prediction models of the early 2010s. But they did not solve the parallelization problem — the sequential dependency was architectural. And their memory management was learned approximation, not exact retrieval. Long-range dependencies were managed better, not perfectly.
By 2017, the state of the art for neural machine translation was a stacked LSTM with an attention mechanism bolted on top — an attention mechanism that would soon become the entire model.
The insight: replace recurrence with pure attention
The Vaswani et al. (2017) paper asked a radical question: what if you discarded recurrence entirely? Instead of processing tokens sequentially and passing hidden states forward, what if you processed the entire sequence at once — and let every token directly attend to every other token simultaneously?
This approach requires more computation per forward pass. But those computations are all independent of each other. They can run in parallel on a GPU. The trade — sequential complexity for parallelizable complexity — turned out to be one of the best trades in the history of computing.
The 2017 paper at a glance Title: "Attention Is All You Need" | Published: arXiv, June 12, 2017 | Authors: Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (Google Brain / Google Research) | Citations as of 2025: ~120,000 |
Tokenization — Turning the World Into Numbers
Before a transformer can process anything, the input must become numbers. The mechanism that does this conversion is called tokenization, and it is both simpler and more subtle than most explanations suggest.
What is a token?
A token is the atomic unit of input to a transformer. For text, tokens are typically word fragments produced by a sub-word tokenizer algorithm called Byte-Pair Encoding (BPE) or WordPiece. Some examples:
"transformer" might tokenize as ["transform", "er"]
"unbelievable" might become ["un", "believ", "able"]
Common short words like "the," "is," "of" are usually single tokens
Numbers, punctuation, and whitespace each have their own tokens
Why sub-word tokenization rather than whole words? Vocabulary size. If every word is a token, you need hundreds of thousands of token IDs — one per word — for a single language, and millions across all languages. Sub-word tokenization caps the vocabulary at a manageable size (GPT-4 uses approximately 100,000 tokens) while still handling rare words, invented words, and foreign language text by combining sub-word pieces.
For images, the Vision Transformer (ViT) approach divides an image into a grid of patches — say, 16×16 pixels each — flattens each patch into a vector, and treats these patch vectors as tokens. A 224×224 image divided into 16×16 patches produces 196 tokens. For audio, the waveform is first converted to a mel-spectrogram (a 2D representation of frequency over time), and spectrogram patches are treated as image patches — then tokens. Everything becomes a sequence of vectors.
Token embeddings: from ID to meaning
Once tokenized, each token receives an integer ID — "the" might be 262, "cat" might be 4758. But raw integers carry no mathematical meaning. 4758 is not 18 times more "cat" than 262 is "the."
The embedding layer solves this by mapping each integer ID to a dense vector — a list of hundreds or thousands of floating-point numbers. These vectors are not hand-crafted; they are parameters of the model, learned during training.
The resulting embedding space has geometric structure that reflects semantic relationships. Similar concepts cluster nearby. The classic demonstration: the vector for "king" minus the vector for "man" plus the vector for "woman" approximately equals the vector for "queen." This geometry is not engineered — it emerges spontaneously from the statistical patterns in training data.
Typical embedding dimensions: 512 for small models, 768 for BERT-base, 1,600 for GPT-2 XL, 12,288 for GPT-4 scale. Every token entering the transformer is a vector of this many dimensions.
Positional encoding: teaching order without recurrence
Here is a subtle problem: the transformer processes all tokens simultaneously. If you present it with the tokens ["cat," "the," "sat," "on," "mat," "the"] — shuffled — versus ["the," "cat," "sat," "on," "the," "mat"] — in order — the set of token embeddings is identical. The model has no way to know which order they came in.
The original Vaswani (2017) solution: add a unique positional encoding vector to each token's embedding before it enters the transformer layers. Position 1 gets one specific vector added to its embedding. Position 2 gets a different vector. Position N gets the Nth positional vector. The transformer now sees both what each token is and where it sits in the sequence.
The original positional encoding used sine and cosine functions at different frequencies, producing a unique pattern for each position. A benefit: this formulation allows the model to generalize to sequences longer than those seen in training, because the sine/cosine pattern extends naturally to arbitrary positions.
Modern alternatives include Rotary Position Embedding (RoPE), used in LLaMA, GPT-NeoX, and many other models. RoPE encodes relative rather than absolute position — the model learns to ask "how far apart are these two tokens?" rather than "what is each token's absolute index?" This generalizes better to long sequences and has become the dominant approach in frontier models.
The Attention Mechanism — the Heart of the Transformer
Attention is the core operation of the transformer. Everything else — the feed-forward network, layer normalization, residual connections — supports and stabilizes it. Understanding attention deeply is the key to understanding why transformers are so powerful.
The intuition: a soft, learned lookup
Consider the sentence: "The animal didn't cross the street because it was too tired." What does "it" refer to? The animal? The street? A human reader resolves this instantly: "tired" implies a living creature, not a road. The word "it" attends to "animal" far more than to "street."
An RNN processing this sentence sequentially accumulates context into a single hidden state. By the time it reaches "tired," the signal from "animal" has been diluted through every intervening step. The resolution is unreliable.
Attention solves this directly: when the model processes "it," instead of relying on a degraded hidden state, it runs a direct comparison between "it" and every other token in the sequence — and computes a weighted mixture of all of them, with higher weights for more relevant tokens. "Animal" gets a high weight. "Street" gets a low one. The resulting representation of "it" is rich with contextual information about what it refers to.
Key insight: attention is a soft lookup. Instead of retrieving exactly one entry from memory (a hard lookup, like indexing into an array), you retrieve a weighted blend of all entries — the weights learned to reflect relevance.
Queries, keys, and values: the retrieval system
The attention mechanism uses three vectors for every token: a Query (Q), a Key (K), and a Value (V). These are computed by multiplying each token's embedding by three separate learned weight matrices — W_Q, W_K, and W_V.
Library analogy: the Query is the search term you type into a library catalogue. The Keys are the index tags attached to every book on the shelf. The Values are the actual content of those books. To find what you need, you compare your search term (Query) against every index tag (Key), compute a relevance score, then retrieve a blend of book contents (Values) weighted by those scores.
The computation for a single token i attending to all other tokens proceeds as follows:
Step 1 — Score: compute the dot product of token i's Query vector with the Key vector of every other token. The dot product measures directional similarity — two vectors pointing in the same direction in high-dimensional space have a large dot product, indicating high relevance.
Step 2 — Scale: divide all scores by the square root of the Key vector dimension (√d_k). Without this scaling, attention scores grow very large as the embedding dimension increases, pushing the subsequent softmax into regions with near-zero gradients where training stalls.
Step 3 — Softmax: convert raw scores to probabilities using the softmax function. All resulting attention weights are positive and sum to 1. This produces a probability distribution over all tokens — the "attention distribution" for token i.
Step 4 — Weighted sum: multiply each token's Value vector by its attention weight, then sum all the weighted Values. The result is the new representation of token i — a blend of information from every token in the sequence, proportioned by relevance.
The attention formula Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V Where Q = query matrix, K = key matrix, V = value matrix, d_k = key dimension. The matrix form computes attention for all tokens simultaneously. |
The crucial point about this formula: every token's Query is compared against every other token's Key in one matrix multiplication. All Value aggregations are computed simultaneously. Every step is parallelizable. This is why transformers train so much faster than RNNs on modern hardware — the same computation that RNNs spread across N sequential steps, transformers execute in a handful of matrix multiplications that run in parallel.
Self-attention vs. cross-attention
There are two fundamental variants of attention in transformer architectures:
Self-attention: Q, K, and V are all computed from the same sequence. Each token attends to every other token in the same sequence. This is the mechanism used inside the encoder and inside the decoder's first attention layer. It is how the model builds rich contextual representations of the input.
Cross-attention: Q comes from one sequence (typically the decoder), while K and V come from a different sequence (typically the encoder output). This is how a translation model reads the source-language representation while generating target-language tokens — the decoder queries the encoder's knowledge at every generation step.
The attention matrix: what it reveals about language
For a sequence of N tokens, the attention computation produces an N×N matrix of weights. Row i shows how much token i attends to every other token. Column j shows which tokens most attend to token j.
This matrix is interpretable, and what trained models reveal in it is striking: pronouns attend to their antecedents ("it" attends strongly to "animal"). Verbs attend to their subjects and objects. Articles attend to the nouns they modify. The model has learned syntactic structure without being taught any grammar rules — it emerged from predicting text.
The N×N attention matrix is also the primary bottleneck of the transformer architecture. For N = 1,000 tokens, it contains one million weights. For N = 100,000 tokens, ten billion. Memory and compute scale quadratically with sequence length. This O(N²) scaling is the central engineering challenge of transformer research and a major driver of techniques like Flash Attention, Sparse Attention, and alternative architectures like Mamba.
Multi-Head Attention — Looking From Many Angles at Once
The single-head limitation
A single attention head computes one set of Q/K/V projections and produces one weighted aggregation per token. This means the model can capture one type of relationship at a time — one "lens" on the data.
But language (and images, and audio) contain multiple simultaneous relationship types: syntactic relationships (subject-verb agreement), semantic relationships (coreference, synonymy), positional relationships (nearby tokens are often related), and structural relationships (clause boundaries, paragraph structure). One lens cannot hold all of these in focus simultaneously without trading off between them.
How multi-head attention works
Multi-head attention runs h independent attention operations in parallel, each with its own learned Q, K, V projection matrices. Each "head" therefore looks at the input through a different learned lens — projecting the token embeddings into different subspaces before computing attention.
Dimensionality management: to keep total compute similar to a single full-dimensional attention, the embedding dimension is divided by h before each head's projection. So if the embedding dimension is 512 and there are 8 heads, each head works in a 64-dimensional subspace. All heads run in parallel, their outputs are concatenated (restoring the 512-dimensional size), and a final linear projection blends the concatenated heads into a single representation.
In practice: GPT-2 small uses 12 attention heads. GPT-3 uses 96. Larger models have more heads partly because more heads mean more room for specialization, and partly because wider embeddings allow larger per-head dimensions without sacrificing breadth.
What individual heads actually learn
Research into the behavior of trained attention heads has revealed surprising specialization:
Join our newsletter for regular updates on AI, digital marketing and growth!
Positional heads: attend primarily to tokens a fixed relative distance away (e.g., the token immediately before or after), essentially implementing a local window of attention.
Syntactic heads: track grammatical dependencies — a verb head attending to its subject, an article head attending to its noun.
Coreference heads: link pronouns to their referents across long distances.
Rare-word heads: disproportionately attend to low-frequency, informationally dense tokens.
Delimiter heads: attend strongly to sentence boundaries and special tokens like [CLS] and [SEP].
Analogy: a panel of eight specialist reviewers reading the same document simultaneously — one scanning for factual claims, one tracking pronoun references, one monitoring sentence structure, one noting rare or unusual words. Their combined assessment is richer than any single reviewer's, even though each one reads the same text.
Notably: many attention heads can be pruned (removed) without measurable performance degradation. The model has redundancy built in — like an immune system with overlapping coverage. This redundancy makes large models robust to noise and partial damage.
The Full Transformer Layer — Attention Plus Everything Else
Attention is the star, but it needs supporting cast. A complete transformer layer contains four components, each solving a specific problem that attention alone cannot.
The feed-forward network: processing what attention found
After multi-head attention aggregates contextual information from across the sequence, a position-wise feed-forward network (FFN) processes each token's representation independently — no cross-token interaction at this stage.
Structure: two linear transformations with a non-linear activation (typically GELU or ReLU) in between. The hidden dimension of the FFN is typically four times the embedding dimension. So a 1,024-dimensional model has a 4,096-dimensional FFN hidden layer. This expansion-then-contraction gives the model a wide intermediate space to perform complex transformations.
What the FFN actually stores: research by Geva et al. (2021) suggests FFN layers function as key-value memories. The first linear layer acts as a pattern detector — recognizing input patterns that match stored keys. The second linear layer retrieves the corresponding stored values. Factual knowledge — "the capital of France is Paris," "water consists of hydrogen and oxygen" — appears to be encoded in FFN weights, not in attention weights. Attention gathers context; the FFN applies knowledge.
Residual connections: highways for gradients
After each sub-layer (attention and FFN), the sub-layer's input is added back to its output before passing to the next component: Output = x + Sublayer(x). This is a residual (or skip) connection, and it solves a critical training problem.
In a deep network — GPT-3 has 96 layers — error gradients must travel from the final output loss all the way back to the first layer. Without skip connections, these gradients either vanish (become near-zero, preventing learning in early layers) or explode (become enormous, destabilizing training). Residual connections create direct gradient pathways that skip over layers entirely, ensuring that no layer is more than one step away from a gradient signal.
A secondary benefit: each layer only needs to learn a correction to the current representation, not a complete new representation from scratch. This is the "residual" in the name — the layer learns what to add or remove from an existing representation, not what the representation should be from zero.
Analogy: a relay race where the baton carries a live GPS signal back to the starting line at every handoff. Even if a middle runner fumbles, the starting position is never lost — the gradient always finds its way home.
Layer normalization: keeping activations stable
After each sub-layer plus residual connection, layer normalization rescales each token's representation so that its values have mean approximately zero and variance approximately one across the embedding dimension.
Why this is necessary: deep networks suffer from internal covariate shift — the statistical distribution of activations changes layer by layer as training proceeds. This makes each layer's learning target a moving one: the layer keeps adjusting to a slightly different input distribution even as it tries to learn its own function. Layer normalization stabilizes these distributions, making training dramatically more reliable at scale.
Pre-norm vs. post-norm: the original 2017 transformer applied layer norm after the residual connection (post-norm). GPT-2 and subsequent large language models switched to pre-norm (apply layer norm before the sub-layer, inside the residual branch). Pre-norm is more stable for very deep networks and is now the standard for frontier models.
One complete transformer layer
Assembled, a single transformer layer processes its input token sequence through:
Layer norm (pre-norm variant)
Multi-head self-attention
Residual add (input + attention output)
Layer norm
Feed-forward network
Residual add (input + FFN output)
A complete transformer model stacks N of these layers in sequence. The output of layer N becomes the input to layer N+1. Common depths: GPT-2 small = 12 layers, GPT-2 XL = 48 layers, GPT-3 = 96 layers, estimated GPT-4 = ~120 layers across a mixture-of-experts architecture.
Encoder, Decoder, and Encoder-Decoder — Three Flavors of Transformer
The transformer architecture has three structural variants that appear throughout the AI landscape. Understanding the distinction resolves much of the confusion around why BERT and GPT behave so differently despite being "both transformers."
Encoder-only: deep understanding, no generation
Encoder-only transformers process the entire input sequence with fully bidirectional self-attention — every token can attend to every other token in both directions (past and future) simultaneously. This produces a rich, context-aware representation of every input token that captures global meaning.
What this is good for: tasks that require deep understanding of a fixed input. Text classification (is this review positive or negative?), named entity recognition (which words are person names?), semantic similarity (are these two sentences paraphrases?), question answering (given a passage, answer this question), and information retrieval.
What this is not good for: generation. Bidirectional attention means the model can see future tokens when representing past ones. If you ask it to predict the next word, it is effectively cheating — it has already "seen" the answer in the right-to-left direction. Encoders are readers, not writers.
Training objective: BERT (Devlin et al., Google, 2018) introduced Masked Language Modeling (MLM). Randomly mask 15% of input tokens, then train the model to predict the masked tokens using context from both directions. This requires the model to build deep bidirectional understanding of every token.
Key models: BERT, RoBERTa, ALBERT, DeBERTa, DistilBERT, and embedding models like text-embedding-3 (OpenAI).
Decoder-only: generation as the only job
Decoder-only transformers apply causal (unidirectional) self-attention — each token can only attend to previous tokens, never future ones. This causal mask is enforced by adding negative infinity to attention scores for all future token positions before the softmax, which drives their weights to zero. The model truly cannot see ahead.
This constraint enables autoregressive generation: the model produces tokens one by one. At each step, it takes the full sequence of tokens generated so far and predicts the next one. Append the predicted token. Repeat until the model generates a stop token or reaches a maximum length.
Training objective: next-token prediction (causal language modeling). Given all tokens up to position n, predict token n+1. The training is run in parallel across all positions simultaneously — a teacher-forcing approach where the correct token at each position is supplied during training, so the model learns from every position in every sequence in a single forward pass.
Why next-token prediction is so powerful: to predict the next word reliably across the full diversity of internet text, a model must implicitly learn grammar, factual knowledge, reasoning patterns, dialogue structure, code syntax, and writing style — not because these were explicitly taught, but because they are all predictively useful. Scale this objective to trillions of tokens and the model acquires a remarkably broad competence.
Key models: GPT series (OpenAI), Claude (Anthropic), Gemini (Google), Llama (Meta), Mistral, Falcon, Qwen, and virtually all frontier conversational AI systems.
Encoder-decoder: the original architecture
The original 2017 transformer was an encoder-decoder. The encoder processes the input sequence bidirectionally, producing a rich representation. The decoder generates the output autoregressively, using cross-attention at each step to attend to the encoder's output — so the decoder can directly query what the encoder understood about the input while generating each output token.
What this is good for: sequence-to-sequence tasks where input and output are structurally different — translation (English to French), summarization (article to summary), speech transcription (audio to text), document-to-structured-data extraction.
Key models: T5 (Google), BART (Meta), mT5, MarianMT (translation), Whisper (speech recognition), and many vision-language models.
Why decoder-only dominates at scale: the two-module structure is harder to scale efficiently. Decoder-only models can handle sequence-to-sequence tasks through prompting ("Translate to French: [text]") without architectural changes, and scaling a single module is simpler than scaling two coordinated ones.
Architecture | Attention direction | Best for |
Encoder-only (BERT, RoBERTa) | Fully bidirectional | Classification, NER, embeddings, semantic search |
Decoder-only (GPT-4, Claude, Llama) | Causal — left to right only | Text generation, conversation, reasoning, code |
Encoder-decoder (T5, Whisper) | Bidirectional enc. + causal dec. | Translation, summarization, speech-to-text |
Training Transformers — From Random Weights to Intelligence
A freshly initialized transformer knows nothing. Every weight matrix is set to small random values. Training is the process of adjusting those billions of weights — iteratively, over trillions of examples — until the model learns to predict its training data well enough that general intelligence emerges as a side effect.
Pre-training: self-supervised learning at scale
Pre-training is the first and most compute-intensive phase. The model is trained on a massive, diverse corpus using a self-supervised objective — one where the labels come from the data itself, requiring no human annotation.
For decoder-only models: next-token prediction across the entire corpus. GPT-3 trained on approximately 300 billion tokens drawn from Common Crawl (filtered web text), WebText (curated web pages), digitized books, and Wikipedia. Llama 3 trained on approximately 15 trillion tokens — a 50× increase in data scale in three years.
For encoder-only models: masked language modeling. BERT's training corpus was approximately 3.3 billion tokens (English Wikipedia plus BooksCorpus). Much smaller than GPT-scale models, but sufficient because the bidirectional objective extracts more signal per token.
The key insight of self-supervision: to predict whether the next word is "Paris" or "London" in the sentence "The capital of France is ___," the model must encode the fact that France has a capital city and that it is Paris. This factual knowledge is forced into the model's weights not by explicit instruction, but by the pressure to predict text well. Repeat this across every fact, pattern, and structure in a trillion-token corpus, and the model emerges with remarkable breadth.
Loss, gradients, and backpropagation
At each training step, the model makes predictions, and those predictions are evaluated by a loss function. For next-token prediction, the standard is cross-entropy loss: given the model's probability distribution over the vocabulary for the next token, the loss is the negative log-probability assigned to the correct token. If the model confidently predicts the right answer, loss is near zero. If it is confident about the wrong answer, loss is high.
Backpropagation computes the gradient of this loss with respect to every parameter in the model — all Q, K, V, and FFN weight matrices across all layers. This gradient tells each parameter how much to change, and in which direction, to reduce the loss. The optimizer (typically Adam) then updates the parameters.
Why residual connections matter for training: in a 96-layer network, backpropagation must carry the gradient signal from the output all the way back to layer 1. Without skip connections, this signal would vanish or explode through 96 sequential multiplications. Residual connections create short-circuit paths that allow gradients to flow from any layer directly to earlier layers — making deep transformers trainable where deep RNNs were not.
Scaling laws: why bigger is (provably) better
Kaplan et al. (OpenAI, 2020) made a foundational empirical discovery: transformer performance follows smooth power laws with respect to three factors — model parameters (N), training tokens (D), and compute budget (C). Double any one factor and loss falls by a predictable, consistent amount. These "scaling laws" made training large models into an engineering science rather than a craft.
The Chinchilla paper (Hoffmann et al., DeepMind, 2022) refined this picture: for a fixed compute budget, optimal performance requires balancing N and D roughly equally. The original GPT-3 (175 billion parameters, 300 billion tokens) was undertrained relative to its size — a model with 70 billion parameters trained on 1.4 trillion tokens achieves comparable performance at substantially lower inference cost. This finding reshaped how the field trains large models.
Emergent capabilities: when scale produces surprise As models scale beyond certain parameter thresholds, qualitatively new capabilities appear that were absent in smaller models: few-shot in-context learning, multi-step arithmetic, code generation, chain-of-thought reasoning. These emergent abilities were not explicitly trained for — they appear when the model is large enough. They are not fully understood and remain an active area of research in AI safety and interpretability. |
Fine-tuning and alignment: after pre-training
A pre-trained model is a text-completion engine. It predicts what text comes next. It does not follow instructions, engage in dialogue helpfully, avoid harmful outputs, or exhibit consistent values. Post-training alignment addresses this.
Supervised Fine-Tuning (SFT): train the model on a curated dataset of (prompt, ideal response) pairs. Human experts write both the prompt and the desired model response. The model learns to respond to instructions by example. This phase is relatively small in compute but critical for usability.
Reinforcement Learning from Human Feedback (RLHF): human raters compare pairs of model responses and rank them by quality. A reward model is trained on these preferences. The language model is then fine-tuned to maximize the reward model's scores using PPO (Proximal Policy Optimization). This aligns model outputs with human preferences more flexibly than SFT alone, but is complex and expensive.
Direct Preference Optimization (DPO): a mathematically simpler alternative to RLHF that directly optimizes the language model on preference pairs without training a separate reward model. Increasingly used for alignment fine-tuning due to its stability and simplicity.
Constitutional AI (Anthropic): train the model to critique and revise its own outputs based on a set of stated principles (the "constitution"), then use RL on the self-critiqued data. Reduces reliance on human feedback at scale by making the model itself a partial judge of its own outputs.
Inference — How a Transformer Generates Output
Training is complete. The model's weights are fixed. You type a message and hit enter. What actually happens between your keystroke and the first word of the response appearing on screen?
Autoregressive generation, step by step
Your message is tokenized into a sequence of integer IDs. Those IDs pass through the embedding layer, receive positional encodings, and enter the first transformer layer. The sequence flows through all N layers simultaneously — attention in each layer, FFN in each layer, residual connections throughout. The final layer outputs a vector for every token position.
The vector at the last position is passed through an output projection (a linear layer mapping to the vocabulary size) followed by a softmax, producing a probability distribution over every possible next token. The model samples a token from this distribution, appends it to the sequence, and repeats the entire forward pass with the extended sequence. This continues until the model produces a special end-of-sequence token or reaches the maximum allowed length.
This is why AI text generation feels streaming: it literally is. Each token is generated as a separate forward pass, and can be displayed as it arrives.
Sampling strategies: trading accuracy for creativity
The way a token is selected from the probability distribution has a large effect on output quality and character:
Greedy decoding: always select the highest-probability token. Deterministic and fast, but tends toward repetitive, generic, or circular outputs.
Temperature: divide all logits (pre-softmax scores) by a temperature value T before applying softmax. T < 1 sharpens the distribution — the model becomes more confident and conservative. T > 1 flattens it — more randomness, more creativity, more risk of incoherence. T = 0 is greedy. Most deployed systems use T around 0.7 to 1.0.
Top-k sampling: restrict sampling to the k highest-probability tokens at each step, redistributing their probabilities proportionally before sampling. Prevents very low-probability "wild card" tokens from occasionally being selected.
Top-p (nucleus) sampling: sample from the smallest set of tokens whose cumulative probability exceeds threshold p (e.g., 0.9). When the model is confident (peaked distribution), this is a small set of tokens. When uncertain (flat distribution), it is a larger set. Adapts dynamically to model confidence.
Beam search: maintain B candidate sequences simultaneously, expanding each at every step and keeping the B highest-probability continuations overall. Produces high-probability, grammatically correct outputs but can degenerate into repetitive loops — less commonly used for open-ended generation.
The KV cache: making inference efficient
Naive inference has a severe efficiency problem. At step 500 of generating a long response, computing attention for the new token requires the Key and Value matrices of all 499 previous tokens. Without optimization, you would recompute all 499 K and V matrices from scratch at every single step — re-running the full embedding and projection layers for tokens that have not changed.
The KV cache stores the Key and Value matrices for all previously processed tokens. At each new generation step, only the new token's K and V vectors are computed and appended to the cache. The attention computation for the new token uses the cached K/V matrices plus the new ones. This transforms per-step compute from O(N²) to O(N) — a dramatic speedup for long generations.
The cost of the KV cache is memory. For a large model generating a long sequence, the cache can require gigabytes of GPU VRAM per active sequence. Managing KV cache memory is a significant engineering challenge for systems serving many users simultaneously, and has led to techniques like paged attention (vLLM) that manage KV cache memory with operating-system-style virtual memory pages.
Context windows: the length limit and what drives it
The context window is the maximum sequence length the model can process in a single forward pass — the total number of tokens it can "see" at once, including both the prompt and the generated response.
Representative context windows over time:
Model | Context window |
GPT-2 (2019) | 1,024 tokens (~750 words) |
GPT-3 (2020) | 4,096 tokens (~3,000 words) |
GPT-4 (2023) | 128,000 tokens (~96,000 words) |
Claude 3 (2024) | 200,000 tokens (~150,000 words) |
Gemini 1.5 Pro (2024) | 1,000,000 tokens (~750,000 words) |
Two factors limit context window size. First, the O(N²) memory cost of attention — doubling the context length quadruples the memory required for the attention matrix. Second, positional encoding generalization — models perform worse on sequence lengths significantly beyond what they encountered during training, even with theoretically extensible positional encodings. Both are active areas of engineering and research.
Why Transformers Work Beyond Language — the General Architecture
The most surprising fact about the transformer is not that it works for language — it was designed for language. The surprise is that the same architecture, with different tokenizers and training objectives, achieves state of the art across almost every domain of AI.
The modality-agnostic core
The transformer has no built-in notion of what tokens represent. The architecture simply defines: given a sequence of vectors, compute attention between all pairs, apply an FFN to each, use residual connections and layer norm to stabilize. What those vectors represent — words, image patches, audio frames, amino acids, weather sensor readings — is entirely determined by the tokenizer and encoder that sits before the transformer.
This is the deep reason for the transformer's universality: attention is a general computation. It finds which elements of a set are relevant to each other and aggregates accordingly. Any domain where relational structure matters — where understanding depends on knowing which parts of the input relate to which other parts — is potentially well-served by attention.
Computer vision: Vision Transformers (ViT)
Dosovitskiy et al. (Google Brain, 2020) demonstrated that a standard transformer, applied to 16×16-pixel patches of an image instead of word tokens, achieves competitive performance on ImageNet classification — with no convolutional layers at all, when trained on sufficient data.
The key advantage over convolutional neural networks (CNNs): global attention from layer one. In a CNN, long-range relationships in an image must be built up hierarchically through many layers of local convolutions. A ViT can relate any two patches to each other at the very first layer — a patch in the top-left corner can directly attend to a patch in the bottom-right. This long-range capability is particularly valuable for tasks requiring global image understanding.
ViT or its variants now underpin virtually every frontier vision system: CLIP's image encoder (OpenAI), DINOv2 (Meta), Segment Anything Model (Meta), the image encoders in GPT-4V and Claude 3, and the denoising models inside DALL-E 3 and Stable Diffusion XL.
Biology: AlphaFold 2 and the protein folding revolution
Predicting the three-dimensional structure of a protein from its amino acid sequence — the protein folding problem — had been one of biology's grand open challenges for fifty years. DeepMind's AlphaFold 2 (2020) largely solved it using a transformer-based architecture.
The core module, called the Evoformer, applies transformer-style attention over two representations simultaneously: the sequence of amino acids (like word tokens), and the pairwise relationships between amino acid residues (essentially, attention over all pairs of sequence positions). This joint representation allows the model to reason about which residues are likely to be near each other in 3D space — the essence of structure prediction.
The result: AlphaFold 2 predicted structures for over 200 million proteins — essentially every known protein — with accuracy comparable to experimental X-ray crystallography. The structures were deposited in a public database that has since become one of the most widely used resources in structural biology. Drug discovery, vaccine development, and basic biological research were all transformed within two years.
Other domains where transformers dominate
Music generation: Jukebox (OpenAI), MusicLM (Google), AudioCraft (Meta) — all transformer-based, treating audio tokens as sequences. AudioCraft generates 30-second music clips from text descriptions.
Code: Codex (the model underlying GitHub Copilot), Code Llama, DeepSeek Coder — decoder-only transformers trained on code corpora. Particularly effective because code has strong local and global structure that attention captures well.
Weather forecasting: Pangu-Weather (Huawei, 2023) trained a transformer on 39 years of ERA5 atmospheric reanalysis data. For short-range (1-7 day) forecasting, it outperforms traditional numerical weather prediction models from ECMWF — a striking result from a domain long considered too physically complex for learned models.
Drug discovery: molecular transformers trained on SMILES strings (text representations of molecular structure) predict molecular properties and generate novel drug candidates. Insilico Medicine and others have used these to progress compounds into clinical trials.
Reinforcement learning: Decision Transformer (Chen et al., 2021) recasts RL as a sequence modeling problem — treating (state, action, reward) triples as tokens. On offline RL benchmarks, it matches or exceeds traditional RL algorithms with simpler training.
Current Limitations and the Research Frontiers
The transformer is extraordinary. It is also genuinely limited in specific, well-characterized ways. Honest understanding of both dimensions is what distinguishes AI literacy from AI hype.
Quadratic attention scaling
The O(N²) memory and compute cost of full attention is the primary infrastructure bottleneck. Processing a 100,000-token document requires ~10 billion attention weight computations per layer. For a 96-layer model, that is nearly a trillion operations — before the FFN layers.
Active engineering solutions: Flash Attention (Dao et al., 2022) — a memory-efficient implementation of exact attention that reorders operations to exploit GPU memory hierarchy, achieving 2-4× speedup with identical results. Sparse attention variants (Longformer, BigBird) attend to a local window plus a set of global tokens, reducing O(N²) to O(N log N) or O(N). Ring Attention distributes the attention computation across multiple devices for contexts too long to fit on a single GPU.
Alternative architectures: state-space models (SSMs), particularly Mamba (Gu & Dao, 2023), process sequences in linear time O(N) using structured state transitions rather than attention. On many language benchmarks, well-trained Mamba models are competitive with transformers of similar size. Whether SSMs can match transformers at the frontier scale remains an open research question.
Systematic generalization: interpolation vs. extrapolation
Transformers are exceptionally good at interpolation — performing well on inputs that fall within the distribution of the training data. They are significantly weaker at systematic generalization — applying a rule or operation correctly to input combinations that were not present in training.
Classic examples: GPT-4 can multiply 6-digit numbers sometimes, but fails reliably on 12-digit numbers even though the multiplication algorithm is the same. Models often fail spatial reasoning tasks that require exact rule application rather than approximate pattern matching. Logic puzzles with novel structures, not seen in training, expose brittle behavior.
This is not merely a scale problem. Larger models make fewer errors of this type, but do not eliminate them systematically. The architecture may be fundamentally better at approximation than at exact algorithmic reasoning — a characteristic that shapes which tasks should or should not be delegated to transformer-based AI.
No persistent memory
A transformer's entire "memory" of a conversation is the context window. Nothing from previous conversations persists unless explicitly stored and retrieved. This is a fundamental architectural constraint, not an engineering gap.
Workarounds: Retrieval-Augmented Generation (RAG) — maintain an external vector database, retrieve relevant stored documents at inference time, insert them into the context window. MemGPT (Packer et al., 2023) — treat the context window as a CPU register and manage a hierarchy of external storage tiers (in-context, external database, archive), with the model itself deciding when to page information in and out. These are engineering approximations of persistent memory, not equivalent to it.
The black box: interpretability challenges
Despite attention weights being theoretically inspectable, large transformer models are functionally opaque. We can observe that head 7 in layer 4 attends to pronouns' antecedents — but we cannot reliably trace why any specific output was produced, nor guarantee that a model will behave consistently in novel situations.
Mechanistic interpretability (research programs at Anthropic, EleutherAI, and MIT) is making progress: identifying circuits — subgraphs of attention heads and FFN neurons that implement specific behaviors such as indirect object identification or modular arithmetic. But the field is early, current techniques scale poorly to frontier model sizes, and there is no established method for converting mechanistic understanding into behavioral guarantees.
Research frontiers worth watching
Mixture of Experts (MoE): route each token to a learned subset of expert FFN layers rather than all of them. GPT-4 is widely believed to use MoE. Enables parameter counts 10× larger than dense models at similar inference cost, because most parameters are inactive for any given token.
Speculative decoding: use a small, fast draft model to generate candidate tokens, verify multiple candidates simultaneously with the large model in a single forward pass. Achieves 2-3× inference speedup without any quality loss — effectively free performance gains.
Test-time compute scaling: models like OpenAI o1 and o3 spend additional computation at inference time to generate extended reasoning traces before producing a final answer. Quality scales with inference compute as well as training compute — a new axis of capability improvement.
Hybrid SSM-transformer architectures: Jamba (AI21 Labs) and Zamba interleave transformer attention layers with Mamba state-space layers. The goal: transformer-quality long-range reasoning with linear-time sequence processing. Early results are promising.
Long-context and memory: beyond extending context windows, research into retrieval-integrated transformers (RAG), differentiable memory banks, and recurrent memory transformers aims to give models effective access to long-term information without the O(N²) cost of full attention over everything.
Conclusion — What Makes the Transformer Special
Walk back through what we have covered. RNNs processed sequences step by step, carrying a degrading memory forward — constrained by sequential dependencies and vanishing gradients. LSTMs added gating mechanisms that helped, but did not solve the parallelization problem. Then eight researchers asked: what if you threw out recurrence entirely and let every token directly attend to every other token at once?
The answer was the transformer. Its genius is not the attention formula — variants of attention had appeared in papers before 2017. Its genius is the realization that attention, combined with FFN layers, residual connections, and layer normalization, is sufficient. You need nothing else. Stack these layers, scale the parameters, train on vast data, and intelligence emerges.
The transformer's central insight Attention is a general-purpose relational computation. Any input that can be broken into a sequence of vectors, and where understanding requires knowing which elements relate to which other elements, is potentially well-served by a transformer. That description applies to language, images, audio, protein sequences, molecular graphs, weather fields, and nearly every other structured data type that matters. |
The surprising coda to the story: the researchers who wrote "Attention Is All You Need" were trying to improve machine translation. They were not trying to build the foundation of general-purpose AI. The transformer's universality was discovered after the fact — by researchers across dozens of labs who tried the same architecture on different problems and found that it worked there too. Science rarely produces world-altering results as clean as this one.
The transformer's open problems — quadratic attention scaling, systematic generalization gaps, the absence of persistent memory, fundamental interpretability challenges — are not reasons to dismiss it. They are the research agenda for the next decade of AI. Every one of them has active, well-funded programs working on solutions.
And the closing truth, with appropriate epistemic humility: in 2017, attention was all you needed. In 2025, it still largely is — extended by scale, refined by alignment, and surrounded by engineering that would have seemed fantastical to the authors of the original paper. What it will look like in 2030 is genuinely uncertain. But it will almost certainly be traceable, in a direct line, back to those twelve pages.
Glossary of Key Terms
Term | Simple language definition |
Attention mechanism | The core transformer operation: compare every element of a sequence to every other element, compute relevance scores, and aggregate information weighted by those scores. |
Autoregressive generation | Generating output one token at a time, where each new token is predicted from all previous tokens, then appended before predicting the next. |
Backpropagation | The algorithm for computing how much each model parameter contributed to the prediction error, so that parameters can be adjusted to reduce that error. |
Byte-Pair Encoding (BPE) | A sub-word tokenization algorithm that iteratively merges the most frequent adjacent character pairs, producing a vocabulary of common sub-word units. |
Causal mask | A matrix of negative infinity values added to attention scores for future token positions, ensuring decoder models cannot attend to tokens that have not yet been generated. |
Cross-entropy loss | The training loss for next-token prediction: the negative log-probability the model assigns to the correct next token. Lower = better predictions. |
Cross-attention | Attention where queries come from one sequence (the decoder) and keys/values come from a different sequence (the encoder output). |
Embedding | A dense vector representation of a token learned during training. Captures semantic relationships: similar concepts have similar embedding vectors. |
Feed-forward network (FFN) | The position-wise network in each transformer layer. Processes each token independently after attention has gathered cross-token context. |
KV cache | A stored record of Key and Value matrices for previously processed tokens, enabling efficient autoregressive generation without recomputing past tokens. |
Layer normalization | Rescaling each token's representation to have mean ~0 and variance ~1 across the embedding dimension, stabilizing training of deep networks. |
Masked language modeling (MLM) | The BERT training objective: randomly mask tokens in the input and train the model to predict them from surrounding context in both directions. |
Positional encoding | Information added to token embeddings to tell the model where each token sits in the sequence, since attention itself has no built-in notion of position. |
Residual connection | Adding a sub-layer's input to its output (Output = x + Sublayer(x)), creating gradient highways through deep networks and enabling easier training. |
Scaling laws | Empirical power-law relationships between model size, training data, compute, and loss — allowing prediction of how performance improves with scale. |
Softmax | A function that converts a vector of raw scores into a probability distribution (all positive, summing to 1), used to produce attention weights. |
Temperature | A parameter that controls the sharpness of sampling distributions. Low temperature = confident and conservative. High temperature = more random and creative. |
Token | The atomic unit of input to a transformer — a word fragment for text, a pixel patch for images, a spectrogram slice for audio. |
Transformer | The neural network architecture introduced in 2017, using stacked attention and FFN layers, that underlies virtually all modern large AI models. |
Vision Transformer (ViT) | A transformer applied to images by treating 16x16 pixel patches as tokens, enabling the same architecture to process both text and visual inputs. |