Based on Manifold-Constrained Hyper-Connections 1 and the Megatron-LM integration tracked in issue #2919, PyTorch-path PR #2943, and cuTile kernel-fusion PR #3828. And the latest DeepSeek-V4 released @today (2026/04/24) also harnesses this hyper-connection structure, as expected.
In this post, we're focusing more on the computation flow and dedicated optimizations instead of the theoretical analysis.
Parameterization and manifold projection
Residual connections, introduced in ResNet 2, let each layer learn an additive update instead of a full transformation, which stabilizes gradient flow and enables the training of very deep networks. The single residual stream, however, ties every sublayer to the same width- representation; HC and mHC generalize this skip to parallel streams with learnable, possibly manifold-constrained mixing.
Hyper-Connections (HC) 3 widen the residual from a single stream of width to parallel streams with learnable mixing:
with and . mHC parameterizes these three mappings so that each is projected onto a well-behaved manifold, keeping the composite stable across depth. Given the input hidden matrix at layer , the computation proceeds in two phases.
Phase 1: initial mapping computation
The flattened input preserves full cross-stream context. After an RMSNorm, the dynamic and static mappings are computed as:
where and are learnable linear projections, and reshapes the output from back to . In practice the three projections are packed into a single linear. , and are all learnable parameters.
Phase 2: manifold projection
Each raw mapping is then pushed onto its target manifold:
where is the sigmoid and denotes Sinkhorn-Knopp. The non-negativity enforced on and prevents signal cancellation from positive-negative coefficient composition, acting as a supplementary manifold projection alongside the Birkhoff-polytope constraint on .
Intuition: why manifold can help numerical stability?
Think of the residual streams as buckets that carry signal across depth, and each as a per-layer rule for how to refill them. Without constraints, that rule can both scale a bucket up and flip its sign, so a tiny perturbation compounds multiplicatively layer over layer, like an interest rate free to be negative or larger than one. The manifold projections cap exactly the two failure modes:
- The sigmoid on and on enforce non-negativity, so signals can only be attenuated or summed, never sign-cancelled.
- Sinkhorn-Knopp makes doubly stochastic, so the mix becomes a redistribution of mass across buckets rather than a rescaling of the total.
Together these projections turn the deep-network update into a bounded, sign-preserving, roughly mass-conserving map — the regime where pre-norm residual networks are known to train stably across hundreds of layers.
Forward compute workflow
Each transformer layer applies the mHC pattern twice: once around self-attention, once around the MLP. The per-sublayer flow, with tokens, batch , expansion , and hidden size , is:
The block entry replicates the single input stream times, and the block exit averages the streams back.
Implementation details of Sinkhorn-Knopp
mHC introduces some extra learnable parameters:
The Sinkhorn-Knopp operator itself enforces double stochasticity through iterative normalization. Starting from the positive matrix , it alternates row and column normalization:
where and divide each row or column by its sum. The sequence converges to a doubly-stochastic as ; the implementation uses , with row-max subtraction before the exponential for numerical stability.
Notably, there is an implementation trick: activation checkpointing/recomputation. We only record the input of the Sinkhorn-Knopp operator and reproduce the iteration during backward, as the intermediate activations for autograd ops are unnecessarily stored between forward and backward otherwise.
In backward, it manually enables the autograd recording and
In the DeepSeek-V4 release, they use TileLang to implement the Sinkhorn-Knopp iteration together with the pre/post/res weights split:
Parallelism
- TP: supported. The projection is small and stays replicated; only the sublayer weights are sharded as before.
- SP: HC parameters are marked
sequence_parallel=Trueso their gradients participate in the standard SP allreduce. - CP: unaffected, since mHC is per-token.
- PP: validated at PP=4 on Qwen3-30B-A3B. DualPipe is extended so the first stream copy is cached locally per stage and recompute windows do not cross stage boundaries.
The PP boundary carries an asymmetric width:
input_expandruns once on PP rank 0 (lifting ) andoutput_contractruns once on the last rank'sfinal_layernorm(collapsing for the LM head), while all intermediate P2P send/recv shapes are — the schedule'sget_tensor_shapesmust agree with this layout or P2P will mismatch.
Kernel fusion
PR #3828 is the GB200-targeted follow-up: four cuTile kernels that collapse the per-layer mHC forward and backward into four launches, enabled with --use-fused-mhc.
| kernel | fuses | why it matters | speedup |
|---|---|---|---|
fused_proj_rms | RMSNorm and the projection | every layer computes its mappings here; the row-wise sum-of-squares is reused from the matmul accumulator | 1.40x |
fused_sinkhorn | exponential and the 20 alternating row/column normalizations | hides the iteration count behind a single launch; fp32 math with row-max subtraction and tighter | 6.89x |
fused_h_aggregate | weight broadcast, multiply, and -stream reduction | memory-bound, but avoids materializing the broadcast | 1.13x |
fused_h_post_bda | residual mixing , post-expansion , and bias | hottest op on the wide residual; keeps the mix in registers | 3.24x |
Rough speedups measured in the PR. End-to-end, the paper reports only 6.7% training-time overhead versus baseline dense residuals at 27B with , which is what makes the manifold constraint affordable in production.
Compute and memory overhead
The extra work that mHC adds per sublayer falls into three buckets; everything else (sigmoids, the 20 Sinkhorn-Knopp iterations) is and negligible at scale.
| op | FLOPs |
|---|---|
| compute mappings (linear ) | |
| aggregate (-to-1) + expand (1-to-) | |
| mix (batched ) |
For typical (in the DS-V4 tech report), per sublayer the total is FLOPs. Against for attention QKVO and for a MLP at , the ratio is roughly and respectively.
Yet the paper reports 6.7% end-to-end training overhead — well above what the FLOP count predicts. The gap is dominated by activation memory: the residual stream widens from to , so every op along it — mix, bias-dropout-add, layernorm input, the residual add itself — runs on an larger tensor.
Parameter overhead is small: each HC module adds weights from the packed projection plus biases and three scalar gates. For that is 688k parameters per sublayer, on the order of 0.1% of the sublayer's own GEMM weights which is negligible.
Results
From the paper (dense 27B, matched tokens):
| benchmark | baseline | HC | mHC |
|---|---|---|---|
| MMLU (acc.) | 59.0 | 63.0 | 63.4 |
| BBH (EM) | 43.8 | 48.9 | 51.0 |
| DROP (F1) | 47.0 | 51.6 | 53.9 |
| GSM8K (EM) | 46.7 | 53.2 | 53.8 |
Summary: mHC equals standard hyper-connections plus a Birkhoff-polytope projection on via Sinkhorn-Knopp. The Megatron integration consists of two PRs: a correctness-first PyTorch path with block-level recompute (#2943), followed by four cuTile-fused kernels (#3828) that amortize the -times residual width at roughly 6% to 7% training-time cost.