Vision Transformers Need Registers

Why large ViTs develop mysterious high-norm tokens in background regions, and how adding simple register tokens fixes everything.

This article explains the ICLR 2024 paper Vision Transformers Need Registers by Darcet et al., which discovered a surprising phenomenon in large ViT models and proposed an elegantly simple fix.

The Mystery: Artifacts in Vision Transformers

Something strange happens in large Vision Transformers. When you visualize their attention maps or feature norms, you see scattered “spikes”—patches with abnormally high values appearing in seemingly random locations, mostly in uniform background regions.

Attention maps with and without registers
Figure 1 from the paper: Attention maps from DeiT-III, OpenCLIP, and DINOv2 models. Left column shows artifacts as bright spots scattered across images. Right column shows clean attention maps after adding registers.

These aren’t random glitches. They appear consistently across different training paradigms:

The only model that doesn’t show these artifacts? The original DINO. Understanding why reveals something fundamental about how Vision Transformers process information.

Characterizing the Artifacts

They Have Extremely High Norms

The artifacts correspond to tokens whose output feature vectors have roughly 10x higher norm than normal patches. When you plot the distribution of token norms across many images, you see a clear bimodal pattern:

Bimodal distribution of token norms
Figure 7 from the paper: Distribution of token norms across DINOv2, CLIP, and DeiT-III models. Without registers (left of each pair), a clear bimodal distribution shows ~2-3% of tokens with anomalously high norms. With registers (right), the distribution becomes unimodal.
Key observation: About 2-3% of tokens become "outliers" with norms exceeding 150, while normal tokens stay below 100.

They Appear in Low-Information Regions

Where do these high-norm tokens appear? Not randomly—they concentrate in patches that look similar to their neighbors. Areas of uniform color, texture, or background.

High neighbor similarity
High-norm token likely
Low neighbor similarity
Normal token
Outlier tokens appear where patches are redundant—conveying little unique information.

They Emerge Mid-Network, Mid-Training, in Large Models

The artifacts don’t exist from the start. They develop under specific conditions:

Norms by layer Norms by training iteration Norms by model size
Figure 4 from the paper: (a) Norms spike around layer 15 of 40. (b) Artifacts appear only after ~1/3 of training. (c) Only ViT-Large and bigger models exhibit them.
Condition Artifacts Present?
Early layers (1-10) No
Middle layers (15+) Yes
Early training (<33%) No
Late training (>33%) Yes
Small models (ViT-S/B) No
Large models (ViT-L/H/g) Yes

This pattern suggests the artifacts are an emergent behavior—something the model learns to do when it has enough capacity and training time.

The Hypothesis: Recycled Tokens for Global Computation

Why would a model create these strange high-norm tokens? The paper proposes a compelling explanation:

Hypothesis: Large, well-trained ViTs learn to identify low-information patches and repurpose them as internal "registers" for storing and computing global image information.

Evidence: What Do Outlier Tokens Encode?

The authors probe what information these tokens contain:

Local information (patch position, pixel reconstruction):

Global information (image classification):

Local Info (Position)
Normal: 41.7%
Outlier: 22.8%
Global Info (Classification)
Normal: 65.8%
Outlier: 69.0%
Outlier tokens discard local spatial information but retain (and slightly improve) global image understanding.

The outliers have discarded their local patch information and instead store global image features. They’re functioning as informal registers—but they’re doing it by hijacking patches that “shouldn’t matter.”

The Problem: Why This Matters

If the model works, why care about these artifacts?

1. Corrupted Feature Maps

Dense prediction tasks (segmentation, depth estimation, object detection) rely on spatially coherent feature maps. Artifacts introduce noise:

Original image
Input
DINOv2 without registers
DINOv2 (artifacts)
DINOv2 with registers
DINOv2 + registers
Attention maps showing artifacts (bright scattered pixels in background) that disappear when registers are added.

2. Broken Object Discovery

Methods like LOST (Large-scale Object diScovery from self-supervised Transformers) use attention maps to find objects. Artifacts catastrophically break these methods for large models—which is why researchers were stuck using smaller, less capable models.

3. Uninterpretable Attention

Attention visualization is a key tool for understanding what models “see.” Artifacts make attention maps nearly useless for interpretation.

The Solution: Explicit Register Tokens

The fix is remarkably simple: give the model dedicated tokens for internal computation.

Register Token Formulation
$$\text{Input} = [\texttt{CLS}; \texttt{reg}_1; \ldots; \texttt{reg}_N; \texttt{patch}_1; \ldots; \texttt{patch}_M]$$
[CLS]: Classification token (as usual)
[reg]: New learnable register tokens
[patch]: Image patch embeddings

How Registers Work

  1. Add N learnable tokens to the input sequence (after [CLS], before patches)
  2. Train normally—registers participate in all attention operations
  3. Discard registers at output—only use [CLS] and patch tokens for downstream tasks
Input
[CLS] reg₁ reg₂ reg₃ reg₄ p₁ p₂ ... pₘ
↓ Transformer Layers ↓
Output
[CLS] reg₁ reg₂ reg₃ reg₄ p₁ p₂ ... pₘ
Used for downstream Discarded
Register tokens participate in attention but are discarded at output. They provide dedicated workspace for global computation.

How Many Registers?

Performance vs number of registers Overhead vs number of registers
Figure 8 from the paper: Left: Performance on ImageNet, segmentation, and depth tasks vs. number of registers—4 registers is optimal. Right: Computational overhead is minimal (~2% FLOPs for 4 registers, negligible parameters).
Registers Artifacts Performance Overhead
0 Present Baseline 0%
1 Eliminated Slight drop ~0.5%
4 Eliminated Optimal <2%
16 Eliminated Saturated ~6%

The sweet spot is 4 registers: artifacts completely gone, optimal downstream performance, and less than 2% computational overhead.

Results: Registers Fix Everything

Artifact Elimination

Norm distribution before and after registers
Figure 7 from the paper: Distribution of output norms across all three training methods. With registers, the bimodal distribution becomes unimodal—artifacts are completely eliminated.

Dense Prediction Tasks

Performance on semantic segmentation (ADE20k) and depth estimation (NYUd):

Model ImageNet ADE20k (mIoU) NYUd (RMSE↓)
DeiT-III 84.7 → 84.7 38.9 → 39.1 0.511 → 0.512
OpenCLIP 78.2 → 78.1 26.6 → 26.7 0.702 → 0.661
DINOv2 84.3 → 84.8 46.6 → 47.9 0.378 → 0.366

Registers maintain or improve performance across the board. DINOv2 sees the largest gains.

Object Discovery Unlocked

The most dramatic improvement comes from object discovery methods like LOST:

Model VOC 2007 VOC 2012 COCO 20k
DeiT-III 11.7 → 27.1 13.1 → 32.7 10.7 → 25.1
DINOv2 35.3 → 55.4 40.2 → 60.0 26.9 → 42.0
+20 points on VOC 2007! Registers enable large models to work with object discovery methods that previously only worked with smaller models.

What Do Registers Learn?

Without any explicit supervision, registers spontaneously specialize:

Input image
Input
CLS attention
[CLS]
Register 0 attention
Reg 0
Register 6 attention
Reg 6
Register 8 attention
Reg 8
Figure 9 from the paper: Different registers develop distinct attention patterns. Each register spontaneously specializes—some focus on the object, others on edges or background regions.

Each register develops its own “role” in processing the image—some attend to central objects, others to boundaries, others to textures. The model figures out how to use this extra computational workspace on its own.

Interactive: Norm Distribution Explorer

Explore how token norms distribute across a ViT’s output. Adjust the threshold to see how many tokens would be classified as “outliers.”

Interactive: The bimodal distribution of token norms. Most tokens cluster around norm ~50, but ~2-3% have norms exceeding 150. Drag the threshold to classify outliers.

Why This Matters Beyond ViTs

This paper reveals something fundamental about how Transformers process information:

  1. Emergence of internal structure: Given enough capacity and training, models develop their own computational primitives—even without being told to.

  2. The cost of implicit computation: When models repurpose input tokens for computation, it corrupts the representational space. Explicit workspace is better.

  3. Simple fixes for complex problems: The solution isn’t architectural surgery—it’s just adding 4 tokens. Sometimes the best interventions are minimal.

Connection to LLMs: Similar phenomena have been observed in language models, where certain tokens become "sink" tokens for attention. The register concept may generalize beyond vision.

Takeaways

  1. Large Vision Transformers develop artifacts—high-norm tokens in low-information regions that serve as informal registers for global computation.

  2. Adding explicit register tokens eliminates these artifacts with negligible overhead (<2% FLOPs, ~0.1% parameters).

  3. Registers improve dense prediction tasks and unlock object discovery methods for large models.

  4. 4 registers is the sweet spot—enough to eliminate artifacts and optimize performance.

  5. Registers spontaneously specialize into different functional roles without supervision.

The paper demonstrates that understanding why neural networks develop certain behaviors—even strange ones—can lead to simple, principled improvements.

References

  1. Darcet, T., Oquab, M., Mairal, J., & Bojanowski, P. (2024). Vision Transformers Need Registers. ICLR 2024.

  2. Oquab, M., et al. (2023). DINOv2: Learning Robust Visual Features without Supervision.

  3. Touvron, H., et al. (2022). DeiT III: Revenge of the ViT.

  4. Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP).

  5. Siméoni, O., et al. (2021). Localizing Objects with Self-Supervised Transformers and no Labels (LOST).