[paper] mHC: Manifold-Constrained Hyper-Connections

April 20, 2026, last update: May 15, 2026

Based on Manifold-Constrained Hyper-Connections 1 and the Megatron-LM integration tracked in issue #2919, PyTorch-path PR #2943, and cuTile kernel-fusion PR #3828. And the latest DeepSeek-V4 released @today (2026/04/24) also harnesses this hyper-connection structure, as expected.

In this post, we're focusing more on the computation flow and dedicated optimizations instead of the theoretical analysis.

Parameterization and manifold projection

Residual connections, introduced in ResNet 2, let each layer learn an additive update xl+1=xl+F(xl,Wl)\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, \mathbf{W}_l) instead of a full transformation, which stabilizes gradient flow and enables the training of very deep networks. The single residual stream, however, ties every sublayer to the same width-CC representation; HC and mHC generalize this skip to nn parallel streams with learnable, possibly manifold-constrained mixing.

Hyper-Connections (HC) 3 widen the residual from a single stream of width CC to nn parallel streams with learnable mixing:

xl+1  =  Hlresxl  +  (Hlpost)F(Hlprexl,  Wl)\mathbf{x}_{l+1} \;=\; \mathbf{H}_l^{\mathrm{res}}\, \mathbf{x}_l \;+\; (\mathbf{H}_l^{\mathrm{post}})^\top\, \mathcal{F}(\mathbf{H}_l^{\mathrm{pre}}\, \mathbf{x}_l,\; \mathbf{W}_l)

with HlresRn×n\mathbf{H}_l^{\mathrm{res}} \in \mathbb{R}^{n \times n} and Hlpre,HlpostR1×n\mathbf{H}_l^{\mathrm{pre}}, \mathbf{H}_l^{\mathrm{post}} \in \mathbb{R}^{1 \times n}. mHC parameterizes these three mappings so that each is projected onto a well-behaved manifold, keeping the composite stable across depth. Given the input hidden matrix xlRn×C\mathbf{x}_l \in \mathbb{R}^{n \times C} at layer ll, the computation proceeds in two phases.

Phase 1: initial mapping computation

The flattened input xl=vec(xl)R1×nC\vec{\mathbf{x}}_l = \mathrm{vec}(\mathbf{x}_l) \in \mathbb{R}^{1 \times nC} preserves full cross-stream context. After an RMSNorm, the dynamic and static mappings are computed as:

xl=RMSNorm(xl)H~lpre=αlpre(xlϕlpre)+blpreH~lpost=αlpost(xlϕlpost)+blpostH~lres=αlresmat(xlϕlres)+blres\begin{aligned} \vec{\mathbf{x}}_l' &= \mathrm{RMSNorm}(\vec{\mathbf{x}}_l) \\ \tilde{\mathbf{H}}_l^{\mathrm{pre}} &= \alpha_l^{\mathrm{pre}} \cdot \big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{pre}}\big) + \mathbf{b}_l^{\mathrm{pre}} \\ \tilde{\mathbf{H}}_l^{\mathrm{post}} &= \alpha_l^{\mathrm{post}} \cdot \big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{post}}\big) + \mathbf{b}_l^{\mathrm{post}} \\ \tilde{\mathbf{H}}_l^{\mathrm{res}} &= \alpha_l^{\mathrm{res}} \cdot \mathrm{mat}\big(\vec{\mathbf{x}}_l'\, \boldsymbol{\phi}_l^{\mathrm{res}}\big) + \mathbf{b}_l^{\mathrm{res}} \end{aligned}

where ϕlpre,ϕlpostRnC×n\boldsymbol{\phi}_l^{\mathrm{pre}}, \boldsymbol{\phi}_l^{\mathrm{post}} \in \mathbb{R}^{nC \times n} and ϕlresRnC×n2\boldsymbol{\phi}_l^{\mathrm{res}} \in \mathbb{R}^{nC \times n^2} are learnable linear projections, and mat()\mathrm{mat}(\cdot) reshapes the output from R1×n2\mathbb{R}^{1 \times n^2} back to Rn×n\mathbb{R}^{n \times n}. In practice the three projections are packed into a single nCn2+2nnC \to n^2 + 2n linear. ϕ\boldsymbol{\phi}, b\mathbf{b} and α\alpha are all learnable parameters.

Phase 2: manifold projection

Each raw mapping is then pushed onto its target manifold:

Hlpre=σ(H~lpre),Hlpost=2σ(H~lpost),Hlres=SK(H~lres)\mathbf{H}_l^{\mathrm{pre}} = \sigma(\tilde{\mathbf{H}}_l^{\mathrm{pre}}), \quad \mathbf{H}_l^{\mathrm{post}} = 2\sigma(\tilde{\mathbf{H}}_l^{\mathrm{post}}), \quad \mathbf{H}_l^{\mathrm{res}} = \mathrm{SK}(\tilde{\mathbf{H}}_l^{\mathrm{res}})

where σ()\sigma(\cdot) is the sigmoid and SK()\mathrm{SK}(\cdot) denotes Sinkhorn-Knopp. The non-negativity enforced on Hlpre\mathbf{H}_l^{\mathrm{pre}} and Hlpost\mathbf{H}_l^{\mathrm{post}} prevents signal cancellation from positive-negative coefficient composition, acting as a supplementary manifold projection alongside the Birkhoff-polytope constraint on Hlres\mathbf{H}_l^{\mathrm{res}}.

Intuition: why manifold can help numerical stability?

Think of the nn residual streams as buckets that carry signal across depth, and each Hl\mathbf{H}_l^{*} as a per-layer rule for how to refill them. Without constraints, that rule can both scale a bucket up and flip its sign, so a tiny perturbation compounds multiplicatively layer over layer, like an interest rate free to be negative or larger than one. The manifold projections cap exactly the two failure modes:

  • The sigmoid on Hlpre\mathbf{H}_l^{\mathrm{pre}} and 2σ2\sigma on Hlpost\mathbf{H}_l^{\mathrm{post}} enforce non-negativity, so signals can only be attenuated or summed, never sign-cancelled.
  • Sinkhorn-Knopp makes Hlres\mathbf{H}_l^{\mathrm{res}} doubly stochastic, so the mix becomes a redistribution of mass across buckets rather than a rescaling of the total.

Together these projections turn the deep-network update into a bounded, sign-preserving, roughly mass-conserving map — the regime where pre-norm residual networks are known to train stably across hundreds of layers.

Forward compute workflow

Each transformer layer applies the mHC pattern twice: once around self-attention, once around the MLP. The per-sublayer flow, with ss tokens, batch bb, expansion nn, and hidden size CC, is:

The block entry replicates the single input stream nn times, and the block exit averages the nn streams back.

Implementation details of Sinkhorn-Knopp

mHC introduces some extra learnable parameters:

The Sinkhorn-Knopp operator itself enforces double stochasticity through iterative normalization. Starting from the positive matrix M(0)=exp(H~lres)\mathbf{M}^{(0)} = \exp(\tilde{\mathbf{H}}_l^{\mathrm{res}}), it alternates row and column normalization:

M(t)=Tr(Tc(M(t1)))\mathbf{M}^{(t)} = \mathcal{T}_r\big(\mathcal{T}_c(\mathbf{M}^{(t-1)})\big)

where Tr\mathcal{T}_r and Tc\mathcal{T}_c divide each row or column by its sum. The sequence converges to a doubly-stochastic Hlres=M(tmax)\mathbf{H}_l^{\mathrm{res}} = \mathbf{M}^{(t_{\max})} as tmaxt_{\max} \to \infty; the implementation uses tmax=20t_{\max} = 20, with row-max subtraction before the exponential for numerical stability.

Notably, there is an implementation trick: activation checkpointing/recomputation. We only record the input of the Sinkhorn-Knopp operator and reproduce the iteration during backward, as the intermediate activations for autograd ops are unnecessarily stored between forward and backward otherwise.

In backward, it manually enables the autograd recording and

In the DeepSeek-V4 release, they use TileLang to implement the Sinkhorn-Knopp iteration together with the pre/post/res weights split:

Parallelism

  • TP: supported. The nCn2+2nnC \to n^2 + 2n projection is small and stays replicated; only the sublayer weights are sharded as before.
  • SP: HC parameters are marked sequence_parallel=True so their gradients participate in the standard SP allreduce.
  • CP: unaffected, since mHC is per-token.
  • PP: validated at PP=4 on Qwen3-30B-A3B. DualPipe is extended so the first stream copy xl0\mathbf{x}_{l_0} is cached locally per stage and recompute windows do not cross stage boundaries. The PP boundary carries an asymmetric width: input_expand runs once on PP rank 0 (lifting CnCC \to nC) and output_contract runs once on the last rank's final_layernorm (collapsing nCCnC \to C for the LM head), while all intermediate P2P send/recv shapes are nCnC — the schedule's get_tensor_shapes must agree with this layout or P2P will mismatch.

Kernel fusion

PR #3828 is the GB200-targeted follow-up: four cuTile kernels that collapse the per-layer mHC forward and backward into four launches, enabled with --use-fused-mhc.

kernelfuseswhy it mattersspeedup
fused_proj_rmsRMSNorm and the nCn2+2nnC \to n^2 + 2n projectionevery layer computes its mappings here; the row-wise sum-of-squares is reused from the matmul accumulator1.40x
fused_sinkhornexponential and the 20 alternating row/column normalizationshides the iteration count behind a single launch; fp32 math with row-max subtraction and tighter ϵ=108\epsilon{=}10^{-8}6.89x
fused_h_aggregateweight broadcast, multiply, and nn-stream reductionmemory-bound, but avoids materializing the [s,b,n,C][s,b,n,C] broadcast1.13x
fused_h_post_bdaresidual mixing Hresr\mathbf{H}^{\mathrm{res}}\,\mathbf{r}, post-expansion hpost(x+b)\mathbf{h}_{\mathrm{post}} \odot (\mathbf{x} + \mathbf{b}), and biashottest op on the wide residual; keeps the [n,n]@[n,C][n,n]@[n,C] mix in registers3.24x

Rough 35%\sim 35\% speedups measured in the PR. End-to-end, the paper reports only 6.7% training-time overhead versus baseline dense residuals at 27B with n=4n=4, which is what makes the manifold constraint affordable in production.

Compute and memory overhead

The extra work that mHC adds per sublayer falls into three buckets; everything else (sigmoids, the 20 Sinkhorn-Knopp iterations) is O(n2)O(n^2) and negligible at scale.

opFLOPs
compute mappings (linear nCn2+2nnC \to n^2 + 2n)2sbnC(n2+2n)2\, sb\, nC\, (n^2 + 2n)
aggregate (nn-to-1) + expand (1-to-nn)3sbnC\approx 3\, sb\, nC
mix Hresxl\mathbf{H}^{\mathrm{res}}\, \mathbf{x}_l (batched n×nn×Cn{\times}n \cdot n{\times}C)2sbn2C2\, sb\, n^2 C

For typical n=4n = 4 (in the DS-V4 tech report), per sublayer the total is 236sbC\approx 236\, sbC FLOPs. Against 8sbC28\, sb C^2 for attention QKVO and 16sbC216\, sb C^2 for a 4×4\times MLP at C=7168C=7168, the ratio is roughly 0.4%0.4\% and 0.2%0.2\% respectively.

Yet the paper reports 6.7% end-to-end training overhead — well above what the FLOP count predicts. The gap is dominated by activation memory: the residual stream widens from sbCsbC to sbnCsb\,nC, so every op along it — mix, bias-dropout-add, layernorm input, the residual add itself — runs on an n×n\times larger tensor.

Parameter overhead is small: each HC module adds nC(n2+2n)nC (n^2 + 2n) weights from the packed ϕ\boldsymbol{\phi} projection plus biases and three scalar gates. For n=4,C=7168n = 4, C = 7168 that is \sim688k parameters per sublayer, on the order of 0.1% of the sublayer's own GEMM weights which is negligible.

Results

From the paper (dense 27B, matched tokens):

benchmarkbaselineHCmHC
MMLU (acc.)59.063.063.4
BBH (EM)43.848.951.0
DROP (F1)47.051.653.9
GSM8K (EM)46.753.253.8
🌱

Summary: mHC equals standard hyper-connections plus a Birkhoff-polytope projection on Hres\mathbf{H}^{\mathrm{res}} via Sinkhorn-Knopp. The Megatron integration consists of two PRs: a correctness-first PyTorch path with block-level recompute (#2943), followed by four cuTile-fused kernels (#3828) that amortize the nn-times residual width at roughly 6% to 7% training-time cost.

Footnotes

  1. mHC: Manifold-Constrained Hyper-Connections, arXiv:2512.24880

  2. Deep Residual Learning for Image Recognition, CVPR '16

  3. Hyper-Connections, ICLR '25