[paper] mHC: Manifold-Constrained Hyper-Connections

Based on Manifold-Constrained Hyper-Connections ¹ and the Megatron-LM integration tracked in issue #2919, PyTorch-path PR #2943, and cuTile kernel-fusion PR #3828. And the latest DeepSeek-V4 released @today (2026/04/24) also harnesses this hyper-connection structure, as expected.

In this post, we're focusing more on the computation flow and dedicated optimizations instead of the theoretical analysis.

Parameterization and manifold projection

Residual connections, introduced in ResNet ², let each layer learn an additive update $\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, \mathbf{W}_l)$ instead of a full transformation, which stabilizes gradient flow and enables the training of very deep networks. The single residual stream, however, ties every sublayer to the same width- $C$ representation; HC and mHC generalize this skip to $n$ parallel streams with learnable, possibly manifold-constrained mixing.

Hyper-Connections (HC) ³ widen the residual from a single stream of width $C$ to $n$ parallel streams with learnable mixing:

\mathbf{x}_{l+1} \;=\; \mathbf{H}_l^{\mathrm{res}}\, \mathbf{x}_l \;+\; (\mathbf{H}_l^{\mathrm{post}})^\top\, \mathcal{F}(\mathbf{H}_l^{\mathrm{pre}}\, \mathbf{x}_l,\; \mathbf{W}_l)

with $\mathbf{H}_l^{\mathrm{res}} \in \mathbb{R}^{n \times n}$ and $\mathbf{H}_l^{\mathrm{pre}}, \mathbf{H}_l^{\mathrm{post}} \in \mathbb{R}^{1 \times n}$ . mHC parameterizes these three mappings so that each is projected onto a well-behaved manifold, keeping the composite stable across depth. Given the input hidden matrix $\mathbf{x}_l \in \mathbb{R}^{n \times C}$ at layer $l$ , the computation proceeds in two phases.

Phase 1: initial mapping computation

The flattened input $\vec{\mathbf{x}}_l = \mathrm{vec}(\mathbf{x}_l) \in \mathbb{R}^{1 \times nC}$ preserves full cross-stream context. After an RMSNorm, the dynamic and static mappings are computed as:

\begin{aligned} \vec{\mathbf{x}}_l' &= \mathrm{RMSNorm}(\vec{\mathbf{x}}_l) \\ \tilde{\mathbf{H}}_l^{\mathrm{pre}} &= \alpha_l^{\mathrm{pre}} \cdot \big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{pre}}\big) + \mathbf{b}_l^{\mathrm{pre}} \\ \tilde{\mathbf{H}}_l^{\mathrm{post}} &= \alpha_l^{\mathrm{post}} \cdot \big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{post}}\big) + \mathbf{b}_l^{\mathrm{post}} \\ \tilde{\mathbf{H}}_l^{\mathrm{res}} &= \alpha_l^{\mathrm{res}} \cdot \mathrm{mat}\big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{res}}\big) + \mathbf{b}_l^{\mathrm{res}} \end{aligned}

where $\boldsymbol{\phi}_l^{\mathrm{pre}}, \boldsymbol{\phi}_l^{\mathrm{post}} \in \mathbb{R}^{nC \times n}$ and $\boldsymbol{\phi}_l^{\mathrm{res}} \in \mathbb{R}^{nC \times n^2}$ are learnable linear projections, and $\mathrm{mat}(\cdot)$ reshapes the output from $\mathbb{R}^{1 \times n^2}$ back to $\mathbb{R}^{n \times n}$ . In practice the three projections are packed into a single $nC \to n^2 + 2n$ linear. $\boldsymbol{\phi}$ , $\mathbf{b}$ and $\alpha$ are all learnable parameters.

Phase 2: manifold projection

Each raw mapping is then pushed onto its target manifold:

\mathbf{H}_l^{\mathrm{pre}} = \sigma(\tilde{\mathbf{H}}_l^{\mathrm{pre}}), \quad \mathbf{H}_l^{\mathrm{post}} = 2\sigma(\tilde{\mathbf{H}}_l^{\mathrm{post}}), \quad \mathbf{H}_l^{\mathrm{res}} = \mathrm{SK}(\tilde{\mathbf{H}}_l^{\mathrm{res}})

where $\sigma(\cdot)$ is the sigmoid and $\mathrm{SK}(\cdot)$ denotes Sinkhorn-Knopp. The non-negativity enforced on $\mathbf{H}_l^{\mathrm{pre}}$ and $\mathbf{H}_l^{\mathrm{post}}$ prevents signal cancellation from positive-negative coefficient composition, acting as a supplementary manifold projection alongside the Birkhoff-polytope constraint on $\mathbf{H}_l^{\mathrm{res}}$ .

Intuition: why manifold can help numerical stability?

Think of the $n$ residual streams as buckets that carry signal across depth, and each $\mathbf{H}_l^{*}$ as a per-layer rule for how to refill them. Without constraints, that rule can both scale a bucket up and flip its sign, so a tiny perturbation compounds multiplicatively layer over layer, like an interest rate free to be negative or larger than one. The manifold projections cap exactly the two failure modes:

The sigmoid on $\mathbf{H}_l^{\mathrm{pre}}$ and $2\sigma$ on $\mathbf{H}_l^{\mathrm{post}}$ enforce non-negativity, so signals can only be attenuated or summed, never sign-cancelled.
Sinkhorn-Knopp makes $\mathbf{H}_l^{\mathrm{res}}$ doubly stochastic, so the mix becomes a redistribution of mass across buckets rather than a rescaling of the total.

Together these projections turn the deep-network update into a bounded, sign-preserving, roughly mass-conserving map — the regime where pre-norm residual networks are known to train stably across hundreds of layers.

Forward compute workflow

Each transformer layer applies the mHC pattern twice: once around self-attention, once around the MLP. The per-sublayer flow, with $s$ tokens, batch $b$ , expansion $n$ , and hidden size $C$ , is:

The block entry replicates the single input stream $n$ times, and the block exit averages the $n$ streams back.

Implementation details of Sinkhorn-Knopp

mHC introduces some extra learnable parameters:

NVIDIA/Megatron-LM/megatron/core/transformer/hyper_connection.pyLines 154 to 161 in 2e55168

Loading…

The Sinkhorn-Knopp operator itself enforces double stochasticity through iterative normalization. Starting from the positive matrix $\mathbf{M}^{(0)} = \exp(\tilde{\mathbf{H}}_l^{\mathrm{res}})$ , it alternates row and column normalization:

\mathbf{M}^{(t)} = \mathcal{T}_r\big(\mathcal{T}_c(\mathbf{M}^{(t-1)})\big)

where $\mathcal{T}_r$ and $\mathcal{T}_c$ divide each row or column by its sum. The sequence converges to a doubly-stochastic $\mathbf{H}_l^{\mathrm{res}} = \mathbf{M}^{(t_{\max})}$ as $t_{\max} \to \infty$ ; the implementation uses $t_{\max} = 20$ , with row-max subtraction before the exponential for numerical stability.

NVIDIA/Megatron-LM/megatron/core/transformer/hyper_connection.pyLines 19 to 26 in 2e55168

Loading…

Notably, there is an implementation trick: activation checkpointing/recomputation. We only record the input of the Sinkhorn-Knopp operator and reproduce the iteration during backward, as the intermediate activations for autograd ops are unnecessarily stored between forward and backward otherwise.

NVIDIA/Megatron-LM/megatron/core/transformer/hyper_connection.pyLines 29 to 54 in 2e55168

Loading…

In backward, it manually enables the autograd recording and

In the DeepSeek-V4 release, they use TileLang to implement the Sinkhorn-Knopp iteration together with the pre/post/res weights split:

deepseek-ai/DeepSeek-V4-Pro/inference/kernel.pyLines 371 to 427 in 44572c

Loading…

Parallelism

TP: supported. The $nC \to n^2 + 2n$ projection is small and stays replicated; only the sublayer weights are sharded as before.
SP: HC parameters are marked sequence_parallel=True so their gradients participate in the standard SP allreduce.
CP: unaffected, since mHC is per-token.
PP: validated at PP=4 on Qwen3-30B-A3B. DualPipe is extended so the first stream copy $\mathbf{x}_{l_0}$ is cached locally per stage and recompute windows do not cross stage boundaries. The PP boundary carries an asymmetric width: input_expand runs once on PP rank 0 (lifting $C \to nC$ ) and output_contract runs once on the last rank's final_layernorm (collapsing $nC \to C$ for the LM head), while all intermediate P2P send/recv shapes are $nC$ — the schedule's get_tensor_shapes must agree with this layout or P2P will mismatch.

Kernel fusion

PR #3828 is the GB200-targeted follow-up: four cuTile kernels that collapse the per-layer mHC forward and backward into four launches, enabled with --use-fused-mhc.

kernel	fuses	why it matters	speedup
`fused_proj_rms`	RMSNorm and the $nC \to n^2 + 2n$ projection	every layer computes its mappings here; the row-wise sum-of-squares is reused from the matmul accumulator	1.40x
`fused_sinkhorn`	exponential and the 20 alternating row/column normalizations	hides the iteration count behind a single launch; fp32 math with row-max subtraction and tighter $\epsilon{=}10^{-8}$	6.89x
`fused_h_aggregate`	weight broadcast, multiply, and $n$ -stream reduction	memory-bound, but avoids materializing the $[s,b,n,C]$ broadcast	1.13x
`fused_h_post_bda`	residual mixing $\mathbf{H}^{\mathrm{res}}\,\mathbf{r}$ , post-expansion $\mathbf{h}_{\mathrm{post}} \odot (\mathbf{x} + \mathbf{b})$ , and bias	hottest op on the wide residual; keeps the $[n,n]@[n,C]$ mix in registers	3.24x

Rough $\sim 35\%$ speedups measured in the PR. End-to-end, the paper reports only 6.7% training-time overhead versus baseline dense residuals at 27B with $n=4$ , which is what makes the manifold constraint affordable in production.

Compute and memory overhead

The extra work that mHC adds per sublayer falls into three buckets; everything else (sigmoids, the 20 Sinkhorn-Knopp iterations) is $O(n^2)$ and negligible at scale.

op	FLOPs
compute mappings (linear $nC \to n^2 + 2n$ )	$2\, sb\, nC\, (n^2 + 2n)$
aggregate ( $n$ -to-1) + expand (1-to- $n$ )	$\approx 3\, sb\, nC$
mix $\mathbf{H}^{\mathrm{res}}\, \mathbf{x}_l$ (batched $n{\times}n \cdot n{\times}C$ )	$2\, sb\, n^2 C$

For typical $n = 4$ (in the DS-V4 tech report), per sublayer the total is $\approx 236\, sbC$ FLOPs. Against $8\, sb C^2$ for attention QKVO and $16\, sb C^2$ for a $4\times$ MLP at $C=7168$ , the ratio is roughly $0.4\%$ and $0.2\%$ respectively.

Yet the paper reports 6.7% end-to-end training overhead — well above what the FLOP count predicts. The gap is dominated by activation memory: the residual stream widens from $sbC$ to $sb\,nC$ , so every op along it — mix, bias-dropout-add, layernorm input, the residual add itself — runs on an $n\times$ larger tensor.

Parameter overhead is small: each HC module adds $nC (n^2 + 2n)$ weights from the packed $\boldsymbol{\phi}$ projection plus biases and three scalar gates. For $n = 4, C = 7168$ that is $\sim$ 688k parameters per sublayer, on the order of 0.1% of the sublayer's own GEMM weights which is negligible.

Results

From the paper (dense 27B, matched tokens):

benchmark	baseline	HC	mHC
MMLU (acc.)	59.0	63.0	63.4
BBH (EM)	43.8	48.9	51.0
DROP (F1)	47.0	51.6	53.9
GSM8K (EM)	46.7	53.2	53.8

🌱

Summary: mHC equals standard hyper-connections plus a Birkhoff-polytope projection on $\mathbf{H}^{\mathrm{res}}$ via Sinkhorn-Knopp. The Megatron integration consists of two PRs: a correctness-first PyTorch path with block-level recompute (#2943), followed by four cuTile-fused kernels (#3828) that amortize the $n$ -times residual width at roughly 6% to 7% training-time cost.

Footnotes

mHC: Manifold-Constrained Hyper-Connections, arXiv:2512.24880 ↩
Deep Residual Learning for Image Recognition, CVPR '16 ↩
Hyper-Connections, ICLR '25 ↩