Understanding the Transformer

This article explains the landmark paper Attention Is All You Need by Vaswani et al. (2017), which introduced the Transformer architecture that powers GPT, BERT, and nearly every modern language model.

Introduction

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism.

The Transformer is a new architecture based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

The Sequential Bottleneck

Before the Transformer, the dominant approach to sequence tasks—machine translation, language modeling, text generation—was the recurrent neural network (RNN), particularly LSTMs and GRUs.

RNNs process sequences one token at a time. To compute the hidden state at position $t$, you need the hidden state at position $t-1$:

h_t = f(h_{t-1}, x_t)

This creates two fundamental problems:

Lack of Parallelization

Because each step depends on the previous step, you cannot parallelize computation within a single sequence. Training is inherently sequential in time. For long sequences, this becomes a severe bottleneck.

Token:

x₁ → x₂ → x₃ → x₄ → x₅

Hidden:

h₁ → h₂ → h₃ → h₄ → h₅

↑ must wait for previous

RNNs process tokens sequentially. Each hidden state depends on the previous one, preventing parallel computation.

Long-Range Dependencies

Information from early tokens must survive many sequential steps to influence later processing. Gradients must flow backward through all those steps. In practice, this makes learning long-range dependencies difficult, even with gating mechanisms like LSTM.

Key question: Can we design an architecture where every position can directly attend to every other position—without sequential dependencies?

The answer is the Transformer.

Attention as a Lookup

The core idea of attention is surprisingly simple: it’s a soft lookup into a set of values, where the lookup key determines how much weight to give each value.

Think of it like a database query:

You have a query (what you’re looking for)
You have a set of keys (labels for stored items)
You have a set of values (the stored items themselves)

The attention mechanism compares your query to each key, computes a relevance score, and returns a weighted combination of the values.

Query: "What information is relevant here?"

Keys:

k₁ k₂ k₃ k₄ k₅

Values:

v₁ v₂ v₃ v₄ v₅

Scores:

0.1 0.7 0.05 0.1 0.05

Output: 0.1·v₁ + 0.7·v₂ + 0.05·v₃ + 0.1·v₄ + 0.05·v₅

Attention computes a weighted sum of values, where weights come from comparing a query to keys. Here, key k₂ matches the query best, so v₂ dominates the output.

Key insight: Attention connects any two positions in constant time. There's no sequential path that information must traverse.

Scaled Dot-Product Attention

The Transformer uses a specific form of attention called Scaled Dot-Product Attention.

Given:

Queries $Q \in \mathbb{R}^{n \times d_k}$ — what we’re looking for
Keys $K \in \mathbb{R}^{m \times d_k}$ — what we’re looking in
Values $V \in \mathbb{R}^{m \times d_v}$ — what we retrieve

The attention output is:

Scaled Dot-Product Attention

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V$$

Query $Q$: what information does this position need?

Key $K$: what information does this position offer?

Value $V$: the actual content to retrieve

Scaling $\sqrt{d_k}$: prevents dot products from growing too large

Step by step

Compute compatibility scores: $QK^T$ gives an $n \times m$ matrix of dot products. Entry $(i, j)$ measures how much query $i$ matches key $j$.
Scale: Divide by $\sqrt{d_k}$. Without scaling, large $d_k$ values push dot products into regions where softmax has very small gradients.
Normalize: Apply softmax row-wise. Each query now has a probability distribution over keys.
Retrieve: Multiply by $V$. Each output is a weighted combination of values.

Why scale?

For large $d_k$, the dot products $q \cdot k$ tend to have large magnitude (variance roughly $d_k$). This pushes softmax into saturated regions where gradients vanish. Scaling by $\sqrt{d_k}$ keeps the variance at 1.

**Figure 2 (left) from the paper:** The Scaled Dot-Product Attention computation. Queries and Keys are dot-product scored, optionally masked, softmax-normalized, and used to weight the Values.

Multi-Head Attention

A single attention function can only focus on one type of relationship at a time. Multi-Head Attention runs multiple attention functions in parallel, each with its own learned projections.

Multi-Head Attention

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

$$\text{where } \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

$W_i^Q, W_i^K, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_k}$: learned projections

$W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$: output projection

Each head can learn to attend to different things:

One head might focus on the previous word
Another might focus on the subject of the sentence
Another might focus on semantically similar words

The paper uses $h = 8$ heads with $d_k = d_v = 64$ (for $d_{\text{model}} = 512$).

**Figure 2 (right) from the paper:** Multi-Head Attention runs multiple single-head attentions in parallel, each with its own learned projections, then concatenates and linearly projects the outputs.

The Transformer Architecture

The Transformer follows the encoder-decoder structure, but built entirely from attention and feed-forward layers.

Encoder

Multi-Head
Self-Attention

Add & Norm

Feed Forward

Add & Norm

×6

Input Embedding

Positional Encoding

Inputs

K, V

Decoder

Masked Multi-Head
Self-Attention

Add & Norm

Multi-Head
Cross-Attention

Add & Norm

Feed Forward

Add & Norm

×6

Output Embedding

Positional Encoding

Outputs (shifted right)

The Transformer architecture (simplified). The encoder (left) processes the input sequence. The decoder (right) generates the output, attending to both itself and the encoder output.

Transformer Architecture — original paper figure — **Figure 1 from the paper:** The full Transformer architecture as presented in Vaswani et al. The left side is the encoder stack; the right side is the decoder, which includes both masked self-attention and cross-attention to the encoder output.

Encoder

Each encoder layer has two sub-layers:

Multi-head self-attention: Every position attends to every position
Feed-forward network: Applied independently to each position

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

Residual connections and layer normalization wrap each sub-layer.

Decoder

Each decoder layer has three sub-layers:

Masked self-attention: Each position attends only to earlier positions
Cross-attention: Queries from decoder; keys/values from encoder
Feed-forward network: Same as encoder

Positional Encoding

Self-attention is permutation-equivariant—it has no notion of position. The Transformer adds positional encodings to the input embeddings.

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

Interactive: Drag the slider or click on the grid to explore how positional encodings change. Each row is a position, each column is a dimension. Orange = sin (even dims), blue = cos (odd dims). Lower dimensions vary slowly; higher dimensions vary quickly.

For any fixed offset $k$, $PE_{pos+k}$ can be written as a linear function of $PE_{pos}$. This allows the model to learn to attend by relative position.

Why Self-Attention?

Layer Type	Complexity	Sequential Ops	Max Path Length
Self-Attention	$O(n^2 \cdot d)$	$O(1)$	$O(1)$
Recurrent	$O(n \cdot d^2)$	$O(n)$	$O(n)$
Convolutional	$O(k \cdot n \cdot d^2)$	$O(1)$	$O(\log_k n)$

Self-attention connects all positions in $O(1)$ sequential operations, enabling full parallelization. It also provides a direct path between any two positions, making long-range dependencies easier to learn.

Trade-off: Self-attention has $O(n^2)$ memory complexity. For very long sequences, this can become prohibitive.

What Attention Heads Learn

The paper visualizes what individual attention heads learn in a trained Transformer. Different heads spontaneously specialize for different linguistic patterns—none of this structure is hard-coded.

Long-range attention dependency — **Figure 3 from the paper:** Self-attention following a long-distance dependency in encoder layer 5. The word "it" attends strongly to "The animal," resolving the coreference across the sentence.

Anaphora resolution head 1 — **Figure 4 from the paper:** Two attention heads in layer 5 that appear to perform anaphora resolution—linking pronouns back to the nouns they refer to.

Anaphora resolution head 2 — **Figure 4 from the paper:** Two attention heads in layer 5 that appear to perform anaphora resolution—linking pronouns back to the nouns they refer to.

Sentence structure attention 1 — **Figure 5 from the paper:** Attention heads exhibiting behaviour related to sentence structure. Different heads learn to attend to different syntactic roles and positional patterns.

Sentence structure attention 2 — **Figure 5 from the paper:** Attention heads exhibiting behaviour related to sentence structure. Different heads learn to attend to different syntactic roles and positional patterns.

Training and Results

Setup:

Data: WMT 2014 English-German (4.5M pairs) and English-French (36M pairs)
Hardware: 8 NVIDIA P100 GPUs
Time: Base model 12 hours; Big model 3.5 days

Model	EN-DE BLEU	EN-FR BLEU	Training Cost
Previous SOTA	26.36	41.29	$7.7 \times 10^{19}$ FLOPs
Transformer (big)	28.4	41.8	$2.3 \times 10^{19}$ FLOPs

The Transformer achieves state-of-the-art results at a fraction of the training cost.

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. NeurIPS 2017.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization.