Understanding the Transformer

Understanding the building blocks and design choices of the Transformer architecture.

This article explains the landmark paper Attention Is All You Need by Vaswani et al. (2017), which introduced the Transformer architecture that powers GPT, BERT, and nearly every modern language model.

Introduction

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.

The Transformer is a new architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

The Sequential Bottleneck

Before the Transformer, the dominant approach to sequence tasks—machine translation, language modeling, text generation—was the recurrent neural network (RNN), particularly LSTMs and GRUs.

RNNs process sequences one token at a time. To compute the hidden state at position $t$, you need the hidden state at position $t-1$:

$$h_t = f(h_{t-1}, x_t)$$

This creates two fundamental problems:

Lack of Parallelization

Because each step depends on the previous step, you cannot parallelize computation within a single sequence. Training is inherently sequential in time. For long sequences, this becomes a severe bottleneck.

Token:
x₁ x₂ x₃ x₄ x₅
Hidden:
h₁ h₂ h₃ h₄ h₅
↑ must wait for previous
RNNs process tokens sequentially. Each hidden state depends on the previous one, preventing parallel computation.

Long-Range Dependencies

Information from early tokens must survive many sequential steps to influence later processing. Gradients must flow backward through all those steps. In practice, this makes learning long-range dependencies difficult, even with gating mechanisms like LSTM.

Key question: Can we design an architecture where every position can directly attend to every other position—without sequential dependencies?

The answer is the Transformer.

Attention as a Lookup

The core idea of attention is surprisingly simple: it’s a soft lookup into a set of values, where the lookup key determines how much weight to give each value.

Think of it like a database query:

The attention mechanism compares your query to each key, computes a relevance score, and returns a weighted combination of the values.

Query: "What information is relevant here?"
Keys:
k₁ k₂ k₃ k₄ k₅
Values:
v₁ v₂ v₃ v₄ v₅
Scores:
0.1 0.7 0.05 0.1 0.05
Output: 0.1·v₁ + 0.7·v₂ + 0.05·v₃ + 0.1·v₄ + 0.05·v₅
Attention computes a weighted sum of values, where weights come from comparing a query to keys. Here, key k₂ matches the query best, so v₂ dominates the output.
Key insight: Attention connects any two positions in constant time. There's no sequential path that information must traverse.

Scaled Dot-Product Attention

The Transformer uses a specific form of attention called Scaled Dot-Product Attention.

Given:

The attention output is:

Scaled Dot-Product Attention
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$
Query $Q$: what information does this position need?
Key $K$: what information does this position offer?
Value $V$: the actual content to retrieve
Scaling $\sqrt{d_k}$: prevents dot products from growing too large

Step by step

  1. Compute compatibility scores: $QK^T$ gives an $n \times m$ matrix of dot products. Entry $(i, j)$ measures how much query $i$ matches key $j$.

  2. Scale: Divide by $\sqrt{d_k}$. Without scaling, large $d_k$ values push dot products into regions where softmax has very small gradients.

  3. Normalize: Apply softmax row-wise. Each query now has a probability distribution over keys.

  4. Retrieve: Multiply by $V$. Each output is a weighted combination of values.

Why scale?

For large $d_k$, the dot products $q \cdot k$ tend to have large magnitude (variance roughly $d_k$). This pushes softmax into saturated regions where gradients vanish. Scaling by $\sqrt{d_k}$ keeps the variance at 1.

Scaled Dot-Product Attention
Figure 2 (left) from the paper: The Scaled Dot-Product Attention computation. Queries and Keys are dot-product scored, optionally masked, softmax-normalized, and used to weight the Values.

Multi-Head Attention

A single attention function can only focus on one type of relationship at a time. Multi-Head Attention runs multiple attention functions in parallel, each with its own learned projections.

Multi-Head Attention
$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

$$\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
$W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$: learned projections
$W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$: output projection

Each head can learn to attend to different things:

The paper uses $h = 8$ heads with $d_k = d_v = 64$ (for $d_{\text{model}} = 512$).

Multi-Head Attention
Figure 2 (right) from the paper: Multi-Head Attention runs multiple single-head attentions in parallel, each with its own learned projections, then concatenates and linearly projects the outputs.

The Transformer Architecture

The Transformer follows the encoder-decoder structure, but built entirely from attention and feed-forward layers.

Encoder
Multi-Head
Self-Attention
Add & Norm
Feed Forward
Add & Norm
×6
Input Embedding
+
Positional Encoding
Inputs
K, V
Decoder
Masked Multi-Head
Self-Attention
Add & Norm
Multi-Head
Cross-Attention
Add & Norm
Feed Forward
Add & Norm
×6
Output Embedding
+
Positional Encoding
Outputs (shifted right)
The Transformer architecture (simplified). The encoder (left) processes the input sequence. The decoder (right) generates the output, attending to both itself and the encoder output.
Transformer Architecture — original paper figure
Figure 1 from the paper: The full Transformer architecture as presented in Vaswani et al. The left side is the encoder stack; the right side is the decoder, which includes both masked self-attention and cross-attention to the encoder output.

Encoder

Each encoder layer has two sub-layers:

  1. Multi-head self-attention: Every position attends to every position
  2. Feed-forward network: Applied independently to each position
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$

Residual connections and layer normalization wrap each sub-layer.

Decoder

Each decoder layer has three sub-layers:

  1. Masked self-attention: Each position attends only to earlier positions
  2. Cross-attention: Queries from decoder; keys/values from encoder
  3. Feed-forward network: Same as encoder

Positional Encoding

Self-attention is permutation-equivariant—it has no notion of position. The Transformer adds positional encodings to the input embeddings.

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
Interactive: Drag the slider or click on the grid to explore how positional encodings change. Each row is a position, each column is a dimension. Orange = sin (even dims), blue = cos (odd dims). Lower dimensions vary slowly; higher dimensions vary quickly.

For any fixed offset $k$, $PE_{pos+k}$ can be written as a linear function of $PE_{pos}$. This allows the model to learn to attend by relative position.

Why Self-Attention?

Layer Type Complexity Sequential Ops Max Path Length
Self-Attention $O(n^2 \cdot d)$ $O(1)$ $O(1)$
Recurrent $O(n \cdot d^2)$ $O(n)$ $O(n)$
Convolutional $O(k \cdot n \cdot d^2)$ $O(1)$ $O(\log_k n)$

Self-attention connects all positions in $O(1)$ sequential operations, enabling full parallelization. It also provides a direct path between any two positions, making long-range dependencies easier to learn.

Trade-off: Self-attention has $O(n^2)$ memory complexity. For very long sequences, this can become prohibitive.

What Attention Heads Learn

The paper visualizes what individual attention heads learn in a trained Transformer. Different heads spontaneously specialize for different linguistic patterns—none of this structure is hard-coded.

Long-range attention dependency
Figure 3 from the paper: Self-attention following a long-distance dependency in encoder layer 5. The word "it" attends strongly to "The animal," resolving the coreference across the sentence.
Anaphora resolution head 1 Anaphora resolution head 2
Figure 4 from the paper: Two attention heads in layer 5 that appear to perform anaphora resolution—linking pronouns back to the nouns they refer to.
Sentence structure attention 1 Sentence structure attention 2
Figure 5 from the paper: Attention heads exhibiting behaviour related to sentence structure. Different heads learn to attend to different syntactic roles and positional patterns.

Training and Results

Setup:

Model EN-DE BLEU EN-FR BLEU Training Cost
Previous SOTA 26.36 41.29 $7.7 \times 10^{19}$ FLOPs
Transformer (big) 28.4 41.8 $2.3 \times 10^{19}$ FLOPs

The Transformer achieves state-of-the-art results at a fraction of the training cost.

References

  1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.

  2. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.

  3. Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization.