Manifold Constrained Hyperconnections

This article explains mHC: Manifold-Constrained Hyper-Connections by Zhenda Xie, Yixuan Wei, Huanqi Cao et al. at DeepSeek (arXiv:2512.24880, December 2024). It traces a path from the residual connection (2015) through the layer normalization debate to ByteDance's Hyperconnections, then shows how DeepSeek uses convex geometry to make those connections trainable at scale.

Introduction

In January 2025, DeepSeek R1 briefly topped the App Store, beating ChatGPT and Gemini — trained at a fraction of the cost. When DeepSeek published mHC a month later, the question on everyone’s mind was whether this was the same kind of moment.

It isn’t a model release. It’s something quieter and, arguably, more interesting: a principled architectural improvement to how information flows between layers in a deep network. The key insight is geometric — the weight matrix governing that flow should live on the Birkhoff polytope, a beautiful object from combinatorics that ensures routing is always balanced.

To understand why that matters, we need to start at the beginning.

The Scaling Problem

The most natural way to make a neural network smarter is to make it deeper. More layers mean more computation, more abstraction, more capacity. But there’s a catch that plagued the field for years: simply stacking more layers often makes the model worse.

The culprit is the backpropagation chain rule. When you train a network, gradient signals flow backward from the loss through each layer. At every layer, the gradient is multiplied by that layer’s local Jacobian. For a network with $L$ layers:

$$\frac{\partial \mathcal{L}}{\partial h_0} = \prod_{l=1}^{L} \frac{\partial h_l}{\partial h_{l-1}}$$

This product of $L$ terms is the problem. If each term is slightly less than 1 — say, 0.9 — then after 50 layers the gradient is $0.9^{50} \approx 0.005$. After 100 layers: essentially zero. Gradients vanish, and early layers stop learning. Conversely, if each term is slightly greater than 1, gradients explode.

Gradient arrows becoming thinner as they flow backward through layers, illustrating vanishing gradients — Gradient signals flow backward through each layer. With no bypass, each multiplication step attenuates the signal — early layers receive near-zero gradients and stop learning.

The depth dilemma: More layers mean more capacity, but the gradient chain grows multiplicatively weaker with depth. A 100-layer network trained naively often performs worse than a 20-layer one.

ResNet’s Answer — The Skip Connection

In 2015, He et al. solved this with a disarmingly simple idea: instead of hoping the gradient survives the entire chain, give it a shortcut. Add the input directly back to the transformed output.

Residual block: input forks into learned transformation F and identity shortcut, then recombines at addition node — A residual block. The input takes two paths: through the learned transformation $F$, and directly to the addition node. Gradients can always flow through the identity path.

Residual Connection

$$h_l = F(h_{l-1}) + h_{l-1}$$

$F(h_{l-1})$ — the learned transformation (FFN, attention, conv, …)

$h_{l-1}$ — the identity shortcut: input passes through unchanged

The gradient of the loss with respect to $h_{l-1}$ now has two terms: one through $F$ (which may vanish) and one that is simply 1 (the identity path). No matter how deep the network, the gradient never has to multiply through a long chain alone. This single change won the 2015 ImageNet competition and became the standard building block for nearly every deep network since.

Where to Put Layer Norm

When the Transformer arrived in 2017, residual connections came along, but a new question emerged: where should you put Layer Normalization relative to the residual addition?

Side-by-side comparison: Post-LN places LayerNorm after the addition, Pre-LN places it before the sublayer — Two positions for Layer Normalization in a Transformer block. Each creates a different training pathology at scale.

The original Transformer placed LayerNorm after the residual addition (Post-LN). This made gradients near the input unstable, requiring careful learning rate warmup and limiting how deep you could go.

GPT-2 moved LayerNorm before the sublayer (Pre-LN). Training became far more stable. But a new problem surfaced: in very deep Pre-LN networks, the residual stream accumulates scale. Each layer adds to it, and the pure residual path eventually dominates. Later layers contribute marginally — their outputs are tiny relative to the running sum. The representations in different layers start looking nearly identical. This is representation collapse.

The tradeoff: Post-LN suffers gradient instability. Pre-LN suffers representation collapse. For a decade, every large language model had to pick a side.

Hyperconnections

In late 2024, researchers at ByteDance proposed a different approach. Instead of asking where to put LayerNorm within a fixed residual structure, they asked: what if you rethought the residual structure itself?

Their idea, called Hyperconnections (HC), replaces the single residual stream with $n$ parallel sub-streams. Each stream can receive contributions from all other streams, gated by a learned $n \times n$ weight matrix $W$:

Figure 1 from the paper: three-panel comparison of (a) Residual Connection, (b) Hyper-Connections with orange Res/Pre/Post mapping boxes, (c) mHC with green manifold-projected mapping boxes — **Figure 1 from the paper.** (a) Standard residual connection. (b) Hyper-Connections wrap each layer with three learned linear mappings (Res, Pre, Post — orange), enabling $n$ parallel streams to mix. (c) mHC applies a manifold projection $\mathcal{P}_\mathcal{M}$ (green) to each mapping, enforcing the doubly stochastic constraint.

The update for stream $i$ at layer $l$ is:

h_l^{(i)} = \sum_{j=1}^{n} w_{ij} \cdot F_j\!\left(h_{l-1}^{(j)}\right) + h_{l-1}^{(i)}

This allows different parts of the representation to specialize independently — bypassing the seesaw between Post-LN and Pre-LN. The idea works.

But there’s a fatal flaw.

The instability problem: In HC, the weight matrix $W$ is unconstrained. As layers stack, matrix multiplications compound. The paper measures the Amax Gain Magnitude — the maximum amplification across streams — and finds it peaks at 3,000 by layer 60. Training becomes numerically catastrophic.

The Birkhoff Polytope

DeepSeek’s mHC keeps everything that makes Hyperconnections powerful, but adds one constraint: $W$ must be a doubly stochastic matrix.

Doubly Stochastic Matrices

A matrix $W \in \mathbb{R}^{n \times n}$ is doubly stochastic if all entries are non-negative and every row and every column sums to exactly 1:

Doubly Stochastic Constraint

$$W_{ij} \geq 0, \quad \sum_{j=1}^{n} W_{ij} = 1 \;\forall i, \quad \sum_{i=1}^{n} W_{ij} = 1 \;\forall j$$

Row sum = 1: stream $i$ distributes its total weight across all outputs — no signal amplification

Column sum = 1: each output stream receives exactly one unit of total input — no signal accumulation

Here’s a concrete $3 \times 3$ example:

	out 1	out 2	out 3	Σ row
in 1	0.75	0.14	0.11	1.00
in 2	0.10	0.72	0.18	1.00
in 3	0.15	0.14	0.71	1.00
Σ col	1.00	1.00	1.00	—

Intuitively, a doubly stochastic matrix is a conservative router: no stream receives more total weight than it sends, and vice versa. Information is redistributed, not amplified.

The Birkhoff Polytope

The set of all doubly stochastic matrices forms a convex polytope, called the Birkhoff polytope $\mathcal{B}_n$. Its vertices are exactly the $n!$ permutation matrices — the “pure routes” where each stream maps one-to-one to exactly one other stream. Everything inside is a convex mixture of these pure routes.

Birkhoff-von Neumann theorem (1946): Every doubly stochastic matrix is a convex combination of permutation matrices. Equivalently, the vertices of $\mathcal{B}_n$ are exactly the $n!$ permutation matrices. This is a foundational result in combinatorial optimization, and it gives the polytope a clean geometric interpretation.

Why This Stabilizes Training

The key property is the spectral norm. For any $W \in \mathcal{B}_n$, its spectral norm (largest singular value) satisfies:

$$|W|_2 \leq 1$$

This follows from the Perron-Frobenius theorem: doubly stochastic matrices have leading eigenvalue exactly 1, and no eigenvalue exceeds 1 in magnitude. When you multiply by $W$ at every layer, the signal cannot grow. The 3,000× amplitude spike that broke vanilla Hyperconnections cannot happen inside $\mathcal{B}_n$.

The geometric fix: Constraining $W$ to the Birkhoff polytope makes it a spectral contraction ($\|W\|_2 \leq 1$). Stack as many layers as you want — the routing matrix can never amplify signals. Stability is a mathematical guarantee, not a tuning choice.

The contrast in practice is striking. Figure 8 from the paper shows the actual learned weight matrices at individual layers and their cumulative product across 60 layers:

Figure 8: HC weight matrices contain large unbounded values (row sums ±18 at layer 1, reaching ±265 in the 60-layer composite), while mHC matrices are doubly stochastic (all sums ≈ 1) and converge to a near-uniform distribution — **Figure 8 from the paper.** Each matrix is averaged over all tokens in a selected sequence. Row labels (y-axis) show forward signal gain (row sum); column labels (x-axis) show backward gradient gain (column sum). HC's residual mapping has row sums reaching ±18 at layer 1 — and after 60 layers the cumulative product explodes to ±265. mHC's projected mapping is doubly stochastic at every layer (all sums ≈ 1.00), and the 60-layer composite converges to a stable, near-uniform distribution.

The Sinkhorn-Knopp Algorithm

The constraint $W \in \mathcal{B}_n$ is elegant, but how do you enforce it during gradient descent? You can’t simply clamp entries — that breaks the row and column sums simultaneously.

The answer is the Sinkhorn-Knopp algorithm (1967). After each gradient step updates $W$, you project it back onto the Birkhoff polytope by alternating two normalizations:

Sinkhorn-Knopp: Row Step

$$A_{ij} \leftarrow \frac{A_{ij}}{\displaystyle\sum_{k=1}^{n} A_{ik}}$$

Divide each entry by its row sum — all rows now sum to 1

Sinkhorn-Knopp: Column Step

$$A_{ij} \leftarrow \frac{A_{ij}}{\displaystyle\sum_{k=1}^{n} A_{kj}}$$

Divide each entry by its column sum — all columns now sum to 1

Repeat alternating row → column until convergence (typically 3–10 iterations)

Each step is a single element-wise division. The algorithm converges to the unique doubly stochastic matrix closest to the input in KL divergence. In practice, 3–10 alternations suffice to reduce the maximum row/column deviation from 1 to below $10^{-3}$.

Interactive: Sinkhorn-Knopp Visualizer

Edit any cell in the matrix below, then step through the algorithm to watch it converge. Row sums and column sums turn green when they reach 1.

Sinkhorn-Knopp in Action

Start with any non-negative matrix. Each step alternates between row-normalizing and column-normalizing. Watch the sums (shown beside each row and column) converge to 1.

Speed:

Max deviation:

The mHC Formula

Putting it all together, the mHC forward pass is:

mHC Forward Pass

$$\tilde{h}_l^{(i)} = \sum_{j=1}^{n} W_{ij} \cdot h_{l-1}^{(j)}, \qquad W \in \mathcal{B}_n$$ $$h_l^{(i)} = F_i\!\left(\tilde{h}_l^{(i)}\right) + \tilde{h}_l^{(i)}$$

$W \in \mathcal{B}_n$ — the manifold constraint: enforced after each optimizer step via Sinkhorn-Knopp

$\tilde{h}_l^{(i)}$ — the mixed input to stream $i$: a doubly stochastic blend of all streams from the previous layer

$F_i$ — the learned transformation for stream $i$ (attention, FFN, etc.)

Residual — the standard skip connection is preserved within each stream

The difference from vanilla Hyperconnections is exactly one word: $W \in \mathcal{B}_n$. The forward computation is identical. The only extra work is running Sinkhorn-Knopp after each gradient step — a handful of element-wise divisions on an $n \times n$ matrix, negligible compared to the FFN or attention compute.

Training Dynamics and Results

DeepSeek evaluated mHC at 3B, 9B, and 27B parameter scales using a MoE architecture based on DeepSeek-V3, with expansion rate $n = 4$ and Sinkhorn-Knopp iterations capped at $t_\text{max} = 20$.

Training stability (27B model). Figure 5 from the paper shows loss gap and gradient norm over 50K training steps:

Figure 5: left plot shows mHC achieving 0.021 lower loss than baseline; right plot shows HC gradient norm spiking erratically while mHC tracks the stable baseline — **Figure 5 from the paper.** Training stability of the 27B model. (Left) Absolute training loss gap relative to the Pre-LN baseline: mHC achieves a final loss reduction of **0.021**. (Right) Gradient norm over training: HC exhibits large erratic spikes throughout; mHC maintains a stable profile comparable to the baseline.

Benchmark results (27B model, same token budget):

Benchmark	Baseline	HC	mHC
BBH (EM)	43.8	48.9	51.0
DROP (F1)	47.0	51.6	53.9
GSM8K (EM)	46.7	53.2	53.8
HellaSwag (Acc.)	73.7	74.3	74.7
MATH (EM)	22.0	26.4	26.0
MMLU (Acc.)	59.0	63.0	63.4
PIQA (Acc.)	78.5	79.9	80.5
TriviaQA (EM)	54.3	56.3	57.6

Same tokens, better results. mHC outperforms the Pre-LN baseline on every benchmark with no increase in model size or training compute. The Sinkhorn-Knopp projection adds only 6.7% training time overhead at $n = 4$ — a small price for consistent gains.

Key Takeaways

1. Depth needs highways. Residual connections are not an optional convenience — they’re what makes deep networks trainable. Without a bypass path, gradients vanish and early layers stop learning.

2. Hyperconnections generalize the residual. Three learned mappings (Res, Pre, Post) around each layer allow $n$ parallel streams to mix information flexibly, escaping the Post-LN / Pre-LN tradeoff. The idea is sound.

3. The Birkhoff polytope is the right constraint. Doubly stochastic matrices conserve signal magnitude ($|W|_2 \leq 1$) by construction. This turns a beautiful object from combinatorics into a practical stability guarantee for deep learning.

4. Sinkhorn-Knopp is the projector. Two alternating normalizations — row then column — converge to the unique doubly stochastic matrix nearest to the unconstrained gradient update. The compute cost is negligible.

5. mHC is a drop-in improvement. The only change relative to a standard Pre-LN Transformer is replacing the single residual stream with $n$ mHC streams and projecting $W$ after each optimizer step. Architecture, training recipe, and inference code are otherwise identical.

References

Zhenda Xie et al. mHC: Manifold-Constrained Hyper-Connections. DeepSeek, 2024.
Kaiming He et al. Deep Residual Learning for Image Recognition. CVPR 2016.
Ashish Vaswani et al. Attention Is All You Need. NeurIPS 2017.
Ruibin Xiong et al. On Layer Normalization in the Transformer Architecture. ICML 2020.
Dingkang Sun et al. Hyper-Connections. arXiv 2024. (ByteDance)
Richard Sinkhorn & Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 1967.
Garrett Birkhoff. Tres observaciones sobre el álgebra lineal. Universidad Nacional de Tucumán, 1946.